import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
We have seen in the previous chapter what structures are offered by Pandas and how to create them. Another very common way of "creating" a Pandas Dataframe is by importing a table from another format like CSV or Excel.
An Excel table containing the same information as we had in Chapter 1 is provided in composers.xlsx and can be read with the read_excel
function. There are many more readers for other types of data (csv, json, html etc.) but we focus here on Excel.
pd.read_excel('Data/composers.xlsx')
The reader automatically recognized the heaers of the file. However it created a new index. If needed we can specify which column to use as header:
pd.read_excel('Data/composers.xlsx', index_col = 'composer')
If we open the file in Excel, we see that it is composed of more than one sheet. Clearly, when not specifying anything, the reader only reads the first sheet. However we can specify a sheet:
specific_sheet = pd.read_excel('Data/composers.xlsx', index_col = 'composer',sheet_name='Sheet2')
specific_sheet
For each reader, there is a long list of options to specify how the file should be read. We can see all these options using the help (see below). Imagine that our tables contains a title and unnecessary rows: we can use the skiprows
argument. Imagine you have dates in your table: you can use the date_parser
argument to specify how to format them etc.
#use shift+tab within the parenthesis to see optional arguemnts
#pd.read_excel()
As you can see above, some information is missing. Some missing values are marked as "unknown" while other are NaN. NaN is the standard symbol for unknown/missing values and is understood by Pandas while "unknown" is just seen as text. This is impractical as now we have e.g. columns with a mix of numbers and text which will make later computations difficult. What we would like to do is to replace all "irrelevant" values with the standard NaN symbol that says "no information".
Let's first do a regular import:
import1 = pd.read_excel('Data/composers.xlsx', index_col = 'composer',sheet_name='Sheet2')
import1
If we look now at one column, we can see that columns have been imported in different ways. One column is an object, i.e. mixed types, the other contains floats:
import1.birth
import1.death
If we want to do calculations, for example getting summary information using describe()
we have a problem: the death
column is skipped because no calculation can be done with strings:
import1.describe()
Now we specify that 'unknown' should be a NaN value:
import2 = pd.read_excel('Data/composers.xlsx', index_col = 'composer',
sheet_name='Sheet2', na_values=['unknown'])
import2
And now computations are again possible, as Pandas knows how to deal with NaNs:
import2.describe()
Handling bad or missing values is a very important part of data science. Taking care of the most common occurrences at import is a good solution.
We see above that the birth column has been "classified" as a float. However we know that this is not the case, it's just an integer. Here again, we can specify the column type already at import time using the dtype option and a dictionary:
import2 = pd.read_excel('Data/composers.xlsx', index_col = 'composer',sheet_name='Sheet1', na_values=['unknown'],
dtype={'composer':np.str,'birth':np.int32,'death':np.int32,'city':np.str})
import2.birth
Of course we don't have to do all these adjustement at import time. We can also do a default import and check what has to be corrected afterward.
If we missed some bad values at import we can just replace all those directly in the dataframe. We can achieve that by using the replace()
method and specifying what should be replaced:
import1
import_nans = import1.replace('unknown', np.nan)
import_nans.birth
Note that when we fix "bad" values, e.g. here the "unknown" text value with NaNs, Pandas automatically adjust the type of the column, allowing us for exampel to later do mathemtical operations.
import1.death.dtype
import_nans.death.dtype
We can also change the type of a column on an existing Dataframe with the same command as in Numpy:
import2.birth
import2.birth.astype('float')
If we look again at import2:
import2.birth
we see that we didn't actually change the type. Changes on a Dataframe are only effective if we reassign the column:
import2.birth = import2.birth.astype('float')
import2.birth
You can easily export a Dataframe that you worked on. Most commonly you will export it in a common format like CSV:
import2.to_csv('mydataframe.csv')
If you have a complex dataframe that e.g. contains lists, you can save it as a pickle object, a specific Python format that allows one to save complex data:
import2.to_pickle('Data/my_dataframe.pkl')
You can reload this type of data via the pickle loading function of Pandas:
import3 = pd.read_pickle('Data/my_dataframe.pkl')
import3
We will learn more about plotting later, but let's see here some possibilities offered by Pandas. Pandas builds on top of Matplotlib but exploits the knowledge included in Dataframes to improve the default output. Let's see with a simple dataset.
composers = pd.read_excel('Data/composers.xlsx', sheet_name='Sheet5')
We can pass Series to Matplotlib which manages to understand them. Here's a default scatter plot:
plt.plot(composers.birth, composers.death, 'o')
plt.show()
Now we look at the default Pandas output. Different types of plots are accessible when using the data_frame.plot
function via the kind
option. The variables to plot are column names passed as keywords instead of whole series like in Matplotlib:
composers.plot(x = 'birth', y = 'death', kind = 'scatter')
plt.show()
We see that the plot automatically gets axis labels. Another gain is that some obvious options like setting a title are directly accesible when creating the plot:
composers.plot(x = 'birth', y = 'death', kind = 'scatter',
title = 'Composer birth and death', grid = True, fontsize = 15)
plt.show()
One can add even more information on the plot by using more arguments used in a similar way as a grammar of graphics. For example we can color the scatter plot by periods:
composers.plot(x = 'birth', y = 'death',kind = 'scatter',
c = composers.period.astype('category').cat.codes, colormap = 'Reds', title = 'Composer birth and death', grid = True, fontsize = 15)
plt.show()
Here you see already a limitation of the plotting library. To color dots by the peiod category, we had to turn the latter into a series of numbers. We could then rename those to improve the plot, but it's better to use more specialized packages such as Seaborn which allow to realize this kind of plot easily:
sns.scatterplot(data = composers, x = 'birth', y = 'death', hue = 'period')
plt.show()
Some additional plotting options are available in the plot()
module. For example histograms:
composers.plot.hist(alpha = 0.5)
plt.show()
Here you see again the gain from using Pandas: without specifying anything, Pandas made a histogram of the two columns containing numbers, labelled the axis and even added a legend to the plot.
All these features are very nice and very helpful when exploring a dataset. When anaylzing data in depth and creating complex plots, Pandas's plotting might however be limiting and other options such as Seaborn or Plotnine can be used.
Finally, all plots can be "styled" down to the smallest detail, either by using Matplotlib options or by directly applying a style e.g.:
plt.style.use('ggplot')
composers.plot.hist(alpha = 0.5)
plt.show()