import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from plotnine import ggplot, aes
import plotnine as pn
Why should we use Pandas instad of just opening our table in Excel? While Excel is practical to browse through data, it is very cumbersome to use it to combine, re-arrange and thoroughly analyze data: code is hidden and difficult to share, there's no version control, it's difficult to automate tasks, the manual clicking around leads to mistakes etc.
In this course you will learn how to handle tabular data with Pandas, a Python package widely used in the scientific and data science areas. You will learn how to create and import tables, how to combine them, modify them, do statistical analysis on them and finally how to use them to easily create complex visualisations.
So that you see where this leads, we start with a short example of how Pandas can be used in a project. We look here at tables provided by the Federal Office of Statistics about all Swiss towns every year (population, number of vacant houses, unemployement etc.)
towns = pd.read_excel('Datasets/2013.xls', skiprows=list(range(5))+list(range(6,9)), skipfooter=34, index_col='Commune',na_values=['*'])
towns.head()
ax = towns.groupby(pd.cut(towns['Habitants'], np.arange(0, 100000, 10000)),as_index=False).mean().dropna().plot(x = 'Habitants', y = ['UDC','PSS'])
ax.set_ylabel('% Vote')
plt.show()
Above we imported a single table. However often you need to combine information from multiple sources or multiple experiments. This can be extremely tedious to do in Excel. Here it is done in a few lines:
towns = []
for x in range(2013, 2015):
temp_town = pd.read_excel('Datasets/'+str(x)+'.xls', skiprows=list(range(5))+list(range(6,9)), skipfooter=34, index_col='Commune',na_values=['*','+'])
temp_town['year'] = x
towns.append(temp_town)
towns_concat = pd.concat(towns,sort=False)
cities = ['Zürich', 'Bern','Lausanne','Basel','Genève']
Plot for each city a parameter you are interested in like the fraction of vacant appartments:
import seaborn as sns
g = sns.lineplot(data = towns_concat.loc[cities].reset_index(), x = 'year', y = 'Taux de logements vacants',
hue = 'Commune', marker = "o", dashes=False);
plt.legend(loc='upper left')
plt.show()
We can also exploit the plotting capabilities of advanced packages such as plotnine, a Python version of ggplot, to create complex plots with little effort. For example here, we show how the voting depends on how much a town depends on agriculture. We separate the data by year as well as py party.
First we just import two years of data (two different parliaments):
towns = []
for x in [2014, 2018]:
temp_town = pd.read_excel('Datasets/'+str(x)+'.xls', skiprows=list(range(5))+list(range(6,9)), skipfooter=34, index_col='Commune',na_values=['*','+'])
temp_town['year'] = x
towns.append(temp_town)
towns_concat = pd.concat(towns,sort=False)
We recover the necessary information and do some data reshaping (tidying) to be able to easily realize the plot:
towns_parties = towns_concat.reset_index()[['year','UDC','PS','Commune','Surface agricole en %']]
#wide to long
towns_parties = pd.melt(towns_parties, id_vars=['Commune','Surface agricole en %','year'],
value_vars=["UDC", "PS"], value_name='Vote fraction', var_name='Party')
towns_parties.head()
And finally we can plot our data:
sns.lmplot(data = towns_parties.dropna(), x = 'Surface agricole en %',
y = 'Vote fraction', hue = 'year',col = 'Party',
scatter_kws={'alpha' :0.1})
(ggplot(towns_parties.dropna(),aes(x = 'Surface agricole en %', y = 'Vote fraction', color = 'factor(year)', shape = 'Party'))
+ pn.geom_point(alpha = 0.05)
+ pn.geom_smooth(method='lm')
+ pn.stats.stat_summary_bin(fun_data = 'mean_cl_normal', bins = 30)#fun_y = np.mean, fun_ymin=np.var, fun_ymax=np.var, bins = 30)
+ pn.labs(color = 'Year')
+ pn.scale_color_manual(['red','blue'])
).draw();