Several more advanced applications rely on Pandas structure to work. One example is the package scikit-learn, which has become one of the dominant machine learning resource in data science. We are now going to have a very quick look at how Pandas is used in that frame.
We are going to work again with our swiss towns infos and we will see if we can predict the result of a party based on that information.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
We first load the data set:
towns = pd.read_excel('Datasets/2018.xls', skiprows=list(range(5))+list(range(6,9)),
skipfooter=34, index_col='Commune',na_values=['*','X'])
towns = towns.reset_index()
Now we have to get select what featurs we are going to use to predict the vote for UDC. We have to remove the results of the other parties, as those are of course correlated.
We also create a target by selecting only the UDC column
features = towns.drop('PDC', axis=1)
features = features.drop('PS', axis=1)
features = features.drop('Commune', axis=1)
features = features.drop('PVL', axis=1)
features = features.drop('PLR 2)', axis=1)
features = features.drop('PBD', axis=1)
features = features.drop('PST/Sol.', axis=1)
features = features.drop('PEV/PCS', axis=1)
features = features.drop('PES', axis=1)
features = features.drop('Petits partis de droite', axis=1)
features = features.drop('Code commune', axis=1)
features = features.dropna()
targets = features['UDC']
features = features.drop('UDC', axis=1)
features.head()
targets.head()
We need to be able to test whether our ML algorithm is capable of making predictions on data is has not been trained on. We therefore split our dataset into a training and a testing set. Luckily sklearn provides this out of the box if we pass the right dataframes.
from sklearn.model_selection import train_test_split
X, X_test, y, y_test = train_test_split(features, targets,
test_size = 0.2,
random_state = 42)
len(X_test)/len(X)
Sklearn offers a wide range of ML methods. We are not entering into details here and choose a Random Forest regression:
from sklearn.ensemble import RandomForestRegressor
Then we instantiate the model and train it (fit):
random_forrest = RandomForestRegressor(n_estimators=1000)
random_forrest.fit(X, y)
Finally we can use it to make predictions. In particular we can apply it to our test sample and see how it performs:
predictions = random_forrest.predict(X_test)
mae = np.mean(abs(predictions - y_test))
print(mae)
sns.scatterplot(x = y_test, y = predictions);
import scipy.stats
scipy.stats.pearsonr(y_test,predictions)
A random forest classificer has the advantage that it can provide us information about how important each feature is. In other terms which features help the most in predicting:
print(random_forrest.feature_importances_)
The larger the number, the better its predictions power. We can sort this list and see to what features they correspond in our feature Dataframe:
features.keys()[np.argsort(random_forrest.feature_importances_)]
Finally, we can have a look at the actual correlations that seem to be indicated here. For this we select a few features and create a long format table for plotting:
towns_melt = pd.melt(towns, id_vars=['Commune','UDC'],
value_vars=['Etrangers en %','Surface agricole en %','Taux brut de mortalité'])
sns.lmplot(data = towns_melt, x = 'value', y = 'UDC', hue = 'variable', scatter_kws={'alpha' : 0.1});
There are indeed strong correlations where they are expected! Notice alos that we can learn things here: the right-wing party is most successful where there's the least foreigners...