import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
We have seen now most of the basic features of Pandas including importing data, combining dataframes, aggregating information and plotting it. In this chapter, we are going to re-use these concepts with the real data seen in the introduction chapter. We are also going to explore some more advanced plotting libraries that exploit to the maximum dataframe structures.
We are importing here two tables provided openly by the Swiss National Science Foundation. One contains a list of all projects to which funds have been allocated since 1975. The other table contains a list of all people to which funds have been awarded during the same period:
# local import
projects = pd.read_csv('Data/P3_GrantExport.csv',sep = ';')
persons = pd.read_csv('Data/P3_PersonExport.csv',sep = ';')
# import from url
#projects = pd.read_csv('http://p3.snf.ch/P3Export/P3_GrantExport.csv',sep = ';')
#persons = pd.read_csv('http://p3.snf.ch/P3Export/P3_PersonExport.csv',sep = ';')
We can have a brief look at both tables:
projects.head(5)
persons.head(5)
We see that the persons
table gives information such as the role of a person in various projects (applicant, employee etc.), her/his gender etc. The project table on the other side gives information such as the period of a grant, how much money was awarded etc.
What if we now wish to know for example:
We need a way to link the two tables, i.e. create a large table where each row corresponds to a single observation containing information from the two tables such as: applicant, gender, awarded funds, dates etc. We will now go through all necessary steps to achieve that goal.
If each row of the persons table contained a single observation with a single person and a single project (the same person would appear of course multiple times), we could just join the two tables based e.g. on the project ID. Unfortunately, in the persons table, each line corresponds to a single researcher with all projects IDs lumped together in a list. For example:
persons.iloc[10041]
persons.iloc[10041]['Projects as responsible Applicant']
Therefore the first thing we need to do is to split those strings into actual lists. We can do that by using classic Python string splitting. We simply apply
that function to the relevant columns. We need to take care of rows containing NaNs on which we cannot use split()
. We create two series, one for applicants, one for employees:
projID_a = persons['Projects as responsible Applicant'].apply(lambda x: x.split(';') if not pd.isna(x) else np.nan)
projID_e = persons['Projects as Employee'].apply(lambda x: x.split(';') if not pd.isna(x) else np.nan)
projID_a
projID_a[10041]
Now, to avoid problems later we'll only keep rows that are not NaNs. We first add the two series to the dataframe and then remove NaNs:
pd.isna(projID_a)
applicants = persons.copy()
applicants['projID'] = projID_a
applicants = applicants[~pd.isna(projID_a)]
employees = persons.copy()
employees['projID'] = projID_e
employees = employees[~pd.isna(projID_e)]
Now we want each of these projects to become a single line in the dataframe. Here we use a function that we haven't used before called explode
which turns every element in a list into a row (a good illustration of the variety of available functions in Pandas):
applicants = applicants.explode('projID')
employees = employees.explode('projID')
applicants.head(5)
So now we have one large table, where each row corresponds to a single applicant and a single project. We can finally do our merging operation where we combined information on persons and projects. We will do two such operations: one for applicants using the projID_a
column for merging and one using the projID_e
column. We have one last problem to fix:
applicants.loc[1].projID
projects.loc[1]['Project Number']
We need the project ID in the persons table to be a number and not a string. We can try to convert but get an error:
applicants.projID = applicants.projID.astype(int)
employees.projID = employees.projID.astype(int)
It looks like we have a row that doesn't conform to expectation and only contains ''. Let's try to figure out what happened. First we find the location with the issue:
applicants[applicants.projID=='']
Then we look in the original table:
persons.loc[50947]
Unfortunately, as is often the case, we have a misformatting in the original table. The project as applicant entry has a single number but still contains the ;
sign. Therefore when we split the text, we end up with ['8','']
. Can we fix this? We can for example filter the table and remove rows where projID
has length 0:
applicants = applicants[applicants.projID.apply(lambda x: len(x) > 0)]
employees = employees[employees.projID.apply(lambda x: len(x) > 0)]
Now we can convert the projID
column to integer:
applicants.projID = applicants.projID.astype(int)
employees.projID = employees.projID.astype(int)
Finally we can use merge
to combine both tables. We will combine the projects (on 'Project Number') and persons table (on 'projID_a' and 'projID_e'):
merged_appl = pd.merge(applicants, projects, left_on='projID', right_on='Project Number')
merged_empl = pd.merge(employees, projects, left_on='projID', right_on='Project Number')
applicants.head(5)
We now have in those tables information on both scientists and projects. Among other things we now when each project of each scientist has started via the Start Date
column:
merged_empl['Start Date']
If we want to do computations with dates (e.g. measuring time spans) we have to change the type of the column. Currently it is indeed just a string. We could parse that string, but Pandas already offers tools to handle dates. For example we can use pd.to_datetime
to transform the string into a Python datetime
format. Let's create a new date
column:
merged_empl['date'] = pd.to_datetime(merged_empl['Start Date'])
merged_appl['date'] = pd.to_datetime(merged_appl['Start Date'])
merged_empl.iloc[0]['date']
merged_empl.iloc[0]['date'].year
Let's add a year column to our dataframe:
merged_empl['year'] = merged_empl.date.apply(lambda x: x.year)
merged_appl['year'] = merged_appl.date.apply(lambda x: x.year)
As we did in the introduction, we want to be able to broadly classify projects into three categories. We therefore search for a specific string ('Humanities', 'Mathematics','Biology') within the 'Discipline Name Hierarchy' column to create a new column called 'Field'^:
science_types = ['Humanities', 'Mathematics','Biology']
merged_appl['Field'] = merged_appl['Discipline Name Hierarchy'].apply(
lambda el: next((y for y in [x for x in science_types if x in el] if y is not None),None) if not pd.isna(el) else el)
We will use the amounts awarded in our analysis. Let's look at that column:
merged_appl['Approved Amount']
Problem: we have rows that are not numerical. Let's coerce that column to numerical:
merged_appl['Approved Amount'] = pd.to_numeric(merged_appl['Approved Amount'], errors='coerce')
merged_appl['Approved Amount']
We are finally done tidying up our tables so that we can do proper data analysis. We can aggregate data to answer some questions.
Let's see for example what is the average amount awarded every year, split by gender. We keep only the 'Project funding' category to avoid obscuring the results with large funds awarded for specific projects (PNR etc):
merged_projects = merged_appl[merged_appl['Funding Instrument Hierarchy'] == 'Project funding']
grouped_gender = merged_projects.groupby(['Gender','year'])['Approved Amount'].mean().reset_index()
grouped_gender
To generate a plot, we use here Seaborn which uses some elements of a grammar of graphics. For example we can assign variables to each "aspect" of our plot. Here x and y axis are year and amount while color ('hue') is the gender. In one line, we can generate a plot that compiles all the information:
sns.lineplot(data = grouped_gender, x='year', y='Approved Amount', hue='Gender')
There seems to be a small but systematic difference in the average amount awarded.
We can now use a plotting library that is essentially a Python port of ggplot to add even more complexity to this plot. For example, let's split the data also by Field:
import plotnine as p9
grouped_gender_field = merged_projects.groupby(['Gender','year','Field'])['Approved Amount'].mean().reset_index()
grouped_gender_field
(p9.ggplot(grouped_gender_field, p9.aes('year', 'Approved Amount', color='Gender'))
+ p9.geom_point()
+ p9.geom_line()
+ p9.facet_wrap('~Field'))
One of the questions we wanted to answer above was how much time goes by between the first time a scientist is mentioned as "employee" on an application and the first time he applies as main applicant. We have therefore to:
For (1) we can use groupby
and use the Person ID SNSF
ID which is a unique ID assigned to each researcher. Once this aggregation is done, we can summarize each group by looking for the "minimal" date:
first_empl = merged_empl.groupby('Person ID SNSF').date.min().reset_index()
first_appl = merged_appl.groupby('Person ID SNSF').date.min().reset_index()
We have now two dataframes indexed by the Person ID
:
first_empl.head(5)
Now we can again merge the two series to be able to compare applicant/employee start dates for single people:
merge_first = pd.merge(first_appl, first_empl, on = 'Person ID SNSF', suffixes=('_appl', '_empl'))
merge_first
Finally we merge with the full table, based on the index to recover the other paramters:
full_table = pd.merge(merge_first, merged_appl,on = 'Person ID SNSF')
Finally we can add a column to that dataframe as a "difference in dates":
full_table['time_diff'] = full_table.date_appl-full_table.date_empl
full_table.time_diff = full_table.time_diff.apply(lambda x: x.days/365)
full_table.hist(column='time_diff',bins = 50)
We see that we have one strong peak at $\Delta T==0$ which corresponds to people who were paid for the first time through an SNSF grant when they applied themselves. The remaining cases have a peak around $\Delta T==5$ which typically corresponds to the case where a PhD student was payed on a grant and then applied for a postdoc grant ~4-5 years later.
We can go further and ask how dependent this waiting time is on the Field of research. Obviously Humanities are structured very differently
sns.boxplot(data=full_table, y='time_diff', x='Field');
sns.violinplot(data=full_table, y='time_diff', x='Field', );