import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
We have seen already two options to plot data: we can use the "raw" Matplotlib which in principle allows one to create any possible plot, however with lots of code, and we saw the simpler internal Pandas solution. While the latter solution is very practical to quickly look through data, it is rather cumbersome to realise more complex plots.
Here we look at another type of plotting resting on the concepts of the grammar of graphics. This approach allows to create complex plots where data can be simply split in a plot into color, shapes etc. without having to do a grouping operation in beforehand. We will mainly look at Seaborn, and finish with an example with Plotnine, the port to Python of ggplot.
We come back here to the dataset of swiss towns. To make the dataset more interestig we add to it some categorical data. First we attempt to add the main language for each town. It is a good example of the type of data wranglig one ofen has to do by combining information from different sources.
#load table indicating to which canton each town belongs
cantons = pd.read_excel('Datasets/be-b-00.04-osv-01.xls',sheet_name=1)[['KTKZ','ORTNAME']]
#load general table with infos on towns
towns = pd.read_excel('Datasets/2018.xls', skiprows=list(range(5))+list(range(6,9)),
skipfooter=34, index_col='Commune',na_values=['*','X'])
towns = towns.reset_index()
#merge tables using the town name. This adds the canton abbreviation to the main table
towns_canton = pd.merge(towns, cantons, left_on='Commune', right_on='ORTNAME',how = 'inner')
#load data indicating languages of each canton
language = pd.read_excel('Datasets/je-f-01.08.01.02.xlsx',skiprows=[0,2,3,4],skipfooter=11)
languages = language[['Allemand (ou suisse allemand)','Français (ou patois romand)',
'Italien (ou dialecte tessinois/italien des grisons)']]
languages = languages.apply(pd.to_numeric, errors='coerce')
#check which language has majority in each canton
languages['language'] = np.argmax(languages.values.astype(float),axis=1)
code={0:'German', 1:'French', 2:'Italian'}
languages['Language'] = languages.language.apply(lambda x: code[x])
languages['canton'] = language['Unnamed: 0']
languages = languages[['canton','Language']]
#load table matching canton name to abbreviation
cantons_abbrev = pd.read_excel('Datasets/cantons_abbrev.xlsx')
#add full canton name to table by merging on abbreviation
canton_language = pd.merge(languages, cantons_abbrev,on='canton')
#add language by merging on canton abbreviation
towns_language = pd.merge(towns_canton, canton_language, left_on='KTKZ', right_on='abbrev')
towns_language['town_type'] = towns_language['Surface agricole en %'].apply(lambda x: 'Land' if x<50 else 'City')
#Create a new party column and a new party score column
parties = pd.melt(towns_language,id_vars=['Commune'], value_vars=['UDC','PS','PDC'],
var_name= 'Party', value_name='Party score')
towns_language = pd.merge(parties, towns_language, on='Commune')
towns_language
We finally have a table with mostly numerical information but also two categorical data: language and town type (land or city). With Seaborn we can now easily make all sorts of plots. For example what are the average scores of the different parties:
sns.barplot(data = towns_language, y='Party score', x = 'Party');
Do land towns vote more for the right-wing party ?
g = sns.scatterplot(data = towns_language, y='UDC', x = 'Surface agricole en %', s = 10, alpha = 0.5);
g.set_xlim([0,100]);
The greate advantage of using these packages is that they allow to include categories as "aesthetics" of the plot. For example we looked before at average party scores. But are they different between language regions ? We can just specify that the hue (color) should be mapped to the town language:
sns.barplot(data = towns_language, y='Party score', x = 'Party', hue = 'Language');
Similarly with scatter plots. Is the relation between land and voting on the right language dependent ?
g = sns.scatterplot(data = towns_language, y='UDC', x = 'Surface agricole en %', hue = 'Language',
s = 10, alpha = 0.5);
g.set_xlim([0,100]);
We see difference in the last plot, but it is still to clearly see the relation. Luckiliy these packages allow us to either create summary statistics or to fit the data:
g = sns.lmplot(data = towns_language, x = 'Surface agricole en %', y='UDC', hue = 'Language', scatter=True,
scatter_kws={'alpha': 0.1});
g.ax.set_xlim([0,100]);
Now we can also do the same exercise for all parties. Does the relation hold?
g = sns.lmplot(data = towns_language, x = 'Surface agricole en %', y='Party score',
hue = 'Party', scatter=True,
scatter_kws={'alpha': 0.1});
g.ax.set_xlim([0,100]);
We can recover from some other place (Poste) the coordinates of each town. Again by merging we can add that information to our main table:
coords = pd.read_csv('Datasets/plz_verzeichnis_v2.csv', sep=';')[['ORTBEZ18','Geokoordinaten']]
coords['lat'] = coords.Geokoordinaten.apply(lambda x: float(x.split(', ')[0]) if type(x)==str else np.nan)
coords['long'] = coords.Geokoordinaten.apply(lambda x: float(x.split(', ')[1]) if type(x)==str else np.nan)
towns_language = pd.merge(towns_language,coords, left_on='Commune', right_on='ORTBEZ18')
So now we can in addition look at the geography of these parameters. For example, who votes for the right-wing party ?
fix, ax = plt.subplots(figsize = (12,8))
sns.scatterplot(data = towns_language, x= 'long', y = 'lat', hue='UDC', style = 'Language', palette='Reds');
# MZ: if used to ggplot -> use 'plotnine' package
# same grammar as ggplot