In the previous chapters, we have learned how to handle Numpy arrays that can be used to efficiently perform numerical calculations. Those arrays are however homogeneous structures i.e. they can only contain one type of data. Also, even if we have a single type of data, the different rows or columns of an array do not have labels, making it difficult to track what they contain. For such cases, we need a structure closer to a table as can be found in Excel, and these structures are implemented by the package Pandas.
But why can't we simply use Excel then? While Excel is practical to browse through data, it is very cumbersome to use to combine, re-arrange and thoroughly analyze data: code is hidden and difficult to share, there's no version control, it's difficult to automate tasks, the manual clicking around leads to mistakes etc.
In the next chapters, you will learn how to handle tabular data with Pandas, a Python package widely used in the scientific and data science areas. You will learn how to create and import tables, how to combine them, modify them, do statistical analysis on them and finally how to use them to easily create complex visualizations.
So that you see where this leads, we start with a short example of how Pandas can be used in a project. We look here at data provided openly by the Swiss National Science Foundation about grants attributed since 1975.
import numpy as np
import pandas as pd
import seaborn as sns
# local import
projects = pd.read_csv('Data/P3_GrantExport.csv',sep = ';')
# import from url
#projects = pd.read_csv('http://p3.snf.ch/P3Export/P3_GrantExport.csv',sep = ';')
Then we can have a brief look at the table itself that Jupyter displays in a formated way and limit the view to the 5 first rows using head()
:
projects.head(5)
Pandas offers a variety of tools to compile information about data, and that compilation can be done very efficiently without the need for loops, conditionals etc.
For example we can quickly count how many times each University appear in that table. We just use the value_counts()
method for that:
projects['University'].value_counts().head(10)
Then we can very easily plot the resulting information, either using directly Pandas or a more advanced library like Seaborn, plotnine or Altair.
Here first with plain Pandas (using Matplotlib under the hood):
projects['University'].value_counts().head(10).plot(kind='bar')
Unlike Numpy arrays, Pandas can handle a variety of different data types in a dataframe. For example it is very efficient at dealing with dates. We see that our table contains e.g. a Start Date
. We can turn this string into an actual date:
projects['start'] = pd.to_datetime(projects['Start Date'])
projects['year'] = projects.start.apply(lambda x: x.year)
projects.loc[0].start
projects.loc[0].year
Pandas is very efficient at wrangling and aggregating data, i.e. grouping several elements of a table to calculate statistics on them. For example we first need here to convert the Approved Amount
to a numeric value. Certain rows contain text (e.g. "not applicable") and we force the conversion:
projects['Approved Amount'] = pd.to_numeric(projects['Approved Amount'], errors = 'coerce')
Then we want to extract the type of filed without subfields e.g. "Humanities" instead of "Humanities and Social Sciences;Theology & religion". For that we can create a custom function and apply it to an entire column:
science_types = ['Humanities', 'Mathematics','Biology']
projects['Field'] = projects['Discipline Name Hierarchy'].apply(
lambda el: next((y for y in [x for x in science_types if x in el] if y is not None),None) if not pd.isna(el) else el)
Then we group the data by discipline and year, and calculate the mean of each group:
aggregated = projects.groupby(['Institution Country', 'year','Field'], as_index=False).mean()
Finally we can use Seaborn to plot the data by "Field" using just keywords to indicate what the axes and colours should mean (following some principles of the grammar of graphics):
sns.lineplot(data = aggregated, x = 'year', y='Approved Amount', hue='Field');
Note that here, axis labelling, colorouring, legend, interval of confidence have been done automatically based on the content of the dataframe.
We see a drastic augmentation around 2010: let's have a closer look. We can here again group data by year and funding type and calculate the total funding:
grouped = projects.groupby(['year','Funding Instrument Hierarchy']).agg(
total_sum=pd.NamedAgg(column='Approved Amount', aggfunc='sum')).reset_index()
grouped
Now, for each year we keep only the 5 largest funding types to be able to plot them:
group_sorted = grouped.groupby('year',as_index=False).apply(lambda x: (x.groupby('Funding Instrument Hierarchy')
.sum()
.sort_values('total_sum', ascending=False))
.head(5)).reset_index()
Finally, we only keep year in the 2000's:
instruments_by_year = group_sorted[(group_sorted.year > 2005) & (group_sorted.year < 2012)]
import matplotlib.pyplot as plt
plt.figure(figsize=(10,10))
sns.barplot(data=instruments_by_year,
x='year', y='total_sum', hue='Funding Instrument Hierarchy')
We see that the main change, is the sudden increase in funding for national research programs.