7. Pandas objects¶

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Python has a series of data containers (list, dicts etc.) and Numpy offers multi-dimensional arrays, however none of these structures offers a simple way neither to handle tabular data, nor to easily do standard database operations. This is why Pandas exists: it offers a complete ecosystem of structures and functions dedicated to handle large tables with inhomogeneous contents.

In this first chapter, we are going to learn about the two main structures of Pandas: Series and Dataframes.

7.1 Series¶

7.1.1 Simple series¶

Series are a the Pandas version of 1-D Numpy arrays. We are rarely going to use them directly, but they often appear implicitly when handling data from the more general Dataframe structure. We therefore only give here basics.

To understand Series' specificities, let's create one. Usually Pandas structures (Series and Dataframes) are created from other simpler structures like Numpy arrays or dictionaries:

numpy_array = np.array([4,8,38,1,6])

The function pd.Series() allows us to convert objects into Series:

pd_series = pd.Series(numpy_array)
pd_series

0     4
1     8
2    38
3     1
4     6
dtype: int64

The underlying structure can be recovered with the .values attribute:

pd_series.values

array([ 4,  8, 38,  1,  6])

Otherwise, indexing works as for regular arrays:

pd_series[1]

8

7.1.2 Indexing¶

On top of accessing values in a series by regular indexing, one can create custom indices for each element in the series:

pd_series2 = pd.Series(numpy_array, index=['a', 'b', 'c', 'd','e'])

pd_series2

a     4
b     8
c    38
d     1
e     6
dtype: int64

Now a given element can be accessed either by using its regular index:

pd_series2[1]

8

or its chosen index:

pd_series2['b']

8

A more direct way to create specific indexes is to transform as dictionary into a Series:

composer_birth = {'Mahler': 1860, 'Beethoven': 1770, 'Puccini': 1858, 'Shostakovich': 1906}

pd_composer_birth = pd.Series(composer_birth)
pd_composer_birth

Mahler          1860
Beethoven       1770
Puccini         1858
Shostakovich    1906
dtype: int64

pd_composer_birth['Puccini']

1858

7.2 Dataframes¶

In most cases, one has to deal with more than just one variable, e.g. one has the birth year and the death year of a list of composers. Also one might have different types of information, e.g. in addition to numerical variables (year) one might have string variables like the city of birth. The Pandas structure that allow one to deal with such complex data is called a Dataframe, which can somehow be seen as an aggregation of Series with a common index.

7.2.1 Creating a Dataframe¶

To see how to construct such a Dataframe, let's create some more information about composers:

composer_death = pd.Series({'Mahler': 1911, 'Beethoven': 1827, 'Puccini': 1924, 'Shostakovich': 1975})
composer_city_birth = pd.Series({'Mahler': 'Kaliste', 'Beethoven': 'Bonn', 'Puccini': 'Lucques', 'Shostakovich': 'Saint-Petersburg'})

Now we can combine multiple series into a Dataframe by precising a variable name for each series. Note that all our series need to have the same indices (here the composers' name):

composers_df = pd.DataFrame({'birth': pd_composer_birth, 'death': composer_death, 'city': composer_city_birth})
composers_df

A more common way of creating a Dataframe is to construct it directly from a dictionary of lists where each element of the dictionary turns into a column:

dict_of_list = {'birth': [1860, 1770, 1858, 1906], 'death':[1911, 1827, 1924, 1975], 
 'city':['Kaliste', 'Bonn', 'Lucques', 'Saint-Petersburg']}

pd.DataFrame(dict_of_list)

However we now lost the composers name. We can enforce it by providing, as we did before for the Series, a list of indices:

pd.DataFrame(dict_of_list, index=['Mahler', 'Beethoven', 'Puccini', 'Shostakovich'])

7.2.2 Accessing values¶

There are multiple ways of accessing values or series of values in a Dataframe. Unlike in Series, a simple bracket gives access to a column and not an index, for example:

composers_df['city']

Mahler                   Kaliste
Beethoven                   Bonn
Puccini                  Lucques
Shostakovich    Saint-Petersburg
Name: city, dtype: object

returns a Series. Alternatively one can also use the attributes synthax and access columns by using:

composers_df.city

Mahler                   Kaliste
Beethoven                   Bonn
Puccini                  Lucques
Shostakovich    Saint-Petersburg
Name: city, dtype: object

The attributes synthax has some limitations, so in case something does not work as expected, revert to the brackets notation.

When specifiying multiple columns, a DataFrame is returned:

composers_df[['city', 'birth']]

One of the important differences with a regular Numpy array is that here, regular indexing doesn't work:

#composers_df[0,0]

Instead one has to use either the .iloc[] or the .loc[] method. .ìloc[] can be used to recover the regular indexing:

 composers_df.iloc[0,1]

1911

While .loc[] allows one to recover elements by using the explicit index, on our case the composers name:

composers_df.loc['Mahler','death']

1911

Remember that loc and ``ìloc``` use brackets [] and not parenthesis ().

Numpy style indexing works here too

composers_df.iloc[1:3,:]

If you are working with a large table, it might be useful to sometimes have a list of all the columns. This is given by the .keys() attribute:

composers_df.keys()

Index(['birth', 'death', 'city'], dtype='object')

7.2.3 Adding columns¶

It is very simple to add a column to a Dataframe. One can e.g. just create a column a give it a default value that we can change later:

composers_df['country'] = 'default'

composers_df

Or one can use an existing list:

country = ['Austria','Germany','Italy','Russia']

composers_df['country2'] = country

composers_df

	birth	death	city
Mahler	1860	1911	Kaliste
Beethoven	1770	1827	Bonn
Puccini	1858	1924	Lucques
Shostakovich	1906	1975	Saint-Petersburg