Often one has tables that mix regular variables (e.g. the size of cells in microscopy images) with categorical variables (e.g. the type of cell to which they belong). In that case, it is quite usual to split the data using the category to do computations. Pandas allows to do this very easily.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Let's import some data and have a look at them
composers = pd.read_excel('Datasets/composers.xlsx', sheet_name='Sheet5')
composers.head()
# MZ
# you don't have to explicitly go through the table and groupe elements
# simply use the 'groupby' function
What if we want now to count how many composers we have in each category? In classical computing we would maybe do a for loop to count occurrences. Pandas simplifies this with the groupby()
function, which actually groups elements by a certain criteria, e.g. a categorical variable like the period:
composer_grouped = composers.groupby('period')
composer_grouped
# MZ: create new type of object from Pandas
The output is a bit cryptic. What we actually have is a new object called a group which has a lot of handy properties. First let's see what the groups actually are. As for the Dataframe, let's look at a summary of the object:
composer_grouped.describe()
# MZ: get all the statistics by the groups created
# MZ: for example to see the different levels
composers.country.unique()
So we have a dataframe with a statistical summary of the the contents. The "names" of the groups are here the indices of the Dataframe. These names are simply all the different categories that were present in the column we used for grouping. Now we can recover a single group:
composer_grouped.get_group('baroque')
composer_grouped.get_group('post-romantic')
If one has multiple categorical variables, one can also do a grouping on several levels. For example here we want to classify composers both by period and country. For this we just give two column names to the groupby()
function:
# MZ: groupping can be done on multiple columns
composer_grouped = composers.groupby(['period','country'])
composer_grouped.describe()
composer_grouped.get_group(('baroque','Germany'))
The main advantage of this Group object is that it allows us to do very quickly both computations and plotting without having to loop through different categories. Indeed Pandas makes all the work for us: it applies functions on each group and then reassembles the results into a Dataframe (or Series depending on output).
For example we can apply most functions we used for Dataframes (mean, sum etc.) on groups as well and Pandas seamlessly does the work for us:
composer_grouped.mean()
# MZ: often you can directly apply the functions on the Pandas object
composer_grouped.count()
We can also design specific functions (again, like in the case of Dataframes) and apply them on groups:
def mult(ser):
return ser.max() * 3
composer_grouped.apply(mult)
# MZ: most functions can be applied irrespectively of the object (DataFrame, group, Series, etc.)
Let's have a look again at one of our grouped Dataframe on which we applied some summary function like a mean on the age column:
composers['age']= composers['death']-composers['birth']
composers.groupby(['country','period']).age.mean()
Here we have two level of indices, with the main one being the country which contains all periods. Often for plotting we however need to have the information in another format. In particular we would like each of these values to be one observation in a regular table. For example we could have a country vs period table where all elements are the mean age. To do that we need to unstack our multi-level Dataframe:
# MZ: to obtain regular 2dim object
composer_unstacked = composers.groupby(['country','period']).age.mean().unstack()
composer_unstacked
The possibility to create groups gives us also the opportunity to easily create interesting plots without writing too much code. For example we can caluclate the average age of composers in each period and plot it as a bar plot:
composers.groupby('period')['age'].mean().plot(kind = 'bar')
# MZ: group by period and plot the mean of the ages
We can also use our unstacked table of country vs. period to automatically plot all average ages split by country and period:
composer_unstacked.plot(kind = 'bar');
# average age for each country and each period
There are much more powerful ways of using grouping-like features for plotting using the ggplot type grammar of graphics where objects can be grouped within an "aeasthetic". In the example above the "colour aesthetic" would e.g. be assigned to the period variable. Such an approach removes the need to do explicit groupings as done here.