Often one has tables that mix regular variables (e.g. the size of cells in microscopy images) with categorical variables (e.g. the type of cell to which they belong). In that case, it is quite usual to split the data by categories or groups to do computations. Pandas allows to do this very easily.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Let's import some data and have a look at them:
composers = pd.read_excel('Data/composers.xlsx', sheet_name='Sheet5')
composers.head()
We also add a column here to calculate the composers' age:
composers['age'] = composers.death - composers.birth
What if we want now to count how many composers we have in a certain category like the period or country? In classical computing we would maybe do a for loop to count occurrences. Pandas simplifies this with the groupby()
function, which actually groups elements by a certain criteria, e.g. a categorical variable like the period:
composer_grouped = composers.groupby('period')
composer_grouped
The output is a bit cryptic. What we actually have is a new object called a group which has a lot of handy properties. First let's see what the groups actually are. We can find all groups with groups
:
composer_grouped.groups
We have a dictionary, where each period that appears in the Dataframe is a key and each key contains a list of dataframe indices of rows with those periods. We will rarely directly use those indices, as most operations on groups only use those "behind the scene".
For example we can use describe()
on a group object, just like we did it before for a Dataframe:
composer_grouped.describe()#.loc['Austria','birth']
We see here that the statistical analysis has been done for each group, the index of each row being the group name (or key in the dictionary). If we are interested in a specific group we can also easily recover it:
composer_grouped.get_group('classic')
We see that this returns a sub-group from the original table. Effectively it is almost equivalent to:
composers[composers.period == 'classic']
If one has multiple categorical variables, one can also do a grouping on several levels. For example here we want to classify composers both by period and country. For this we just give two column names to the groupby()
function:
composer_grouped = composers.groupby(['period','country'])
composer_grouped.describe()
composer_grouped.get_group(('baroque','Germany'))
The main advantage of this Group object is that it allows us to do very quickly both computations and plotting without having to loop through different categories. Indeed Pandas makes all the work for us: it applies functions on each group and then reassembles the results into a Dataframe (or Series depending on the operation).
For example we can apply most functions we used for Dataframes (mean, sum etc.) on groups as well and Pandas seamlessly does the work for us:
composer_grouped.mean()
composer_grouped.count()
We can also design specific functions (again, like in the case of Dataframes) and apply them on groups:
def mult(myseries):
return myseries.max() * 3
composer_grouped.apply(mult)
As we see above, grouping operations can create more or less complex dataframes by adding one or multiple indexing levels. There are multiple ways to "reshape" such dataframes in order to make thm usable e.g. for plotting. Typically, plotting software based on a grammar of graphics expect a simple 2D dataframe where each line is an observation with several properties.
One of the most common "reshaping" is to reset the index. In its simplest form, it will create a new dataframe, where each row corresponds to one observation. For example in the case of a dataframe with multi-indices, it will re-cast these indices as columns:
composer_grouped = composers.groupby(['period','country']).mean()
composer_grouped.head(10)
composer_grouped.reset_index().head(5)
One can of course be more specific and reset only specific indices e.g. by level:
composer_grouped.reset_index(level=1).head(5)
Another way to move indices to columns is to unstack a dataframe, in other words pivot some indices to columns:
composer_grouped.unstack()
This creates a multi-level column indexing.
A very common operation when handling tables is to switch from wide to long format and vice versa. In our composer example, let's for example imagine that you want both birth
and death
dates to be grouped in a single column called dates
. But you still need to know if that data is a birth or date, so you need a new column that indicates that. To achieve that, we need to specify id_vars
a list of columns to be used as identifiers e.g. the composer name, and value_vars
, a list of columns that should become rows:
composers.head(5)
pd.melt(composers, id_vars=['composer'], value_vars=['birth', 'death'])
We can keep more of the original columns as identifiers and also specify names for the variable and value columns:
melted = pd.melt(composers, id_vars=['composer','period','age','country'], value_vars=['birth', 'death'],
var_name = 'date_type', value_name='dates')
melted
We have seen above that we can create groups and apply functions to them to get some summary of them as new dataframes or series that could then also be reshaped. The final result of these operations is then ideally suited to be plotted in a very efficient way.
Here's a simple example: we group composers by periods and then calculate the mean age, resulting in a series where periods are indices:
composers.groupby('period')['age'].mean()
We can just add one more operation to that line to create a bar plot illustrating this:
composers.groupby('period')['age'].mean().plot(kind = 'bar');
The built-in plotting capabilities of Pandas automatically used the indices to label the bars, and also used the series name as a general label.
Using more advanced libraries, we can go further than that and use multiple columns to create complex plots. This will be shown in the next chapter.