10. Combining information in Pandas¶

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Often information is comming from different sources and it is necessary to combine it into one object. We are going to see the different ways in which information contained within separate Dataframes can be combined in a meaningful way.

10.1 Concatenation¶

The simplest way we can combine two Dataframes is simply to "paste" them together:

composers1 = pd.read_excel('Data/composers.xlsx', index_col='composer',sheet_name='Sheet1')
composers1

composers2 = pd.read_excel('Data/composers.xlsx', index_col='composer',sheet_name='Sheet3')
composers2

To be concatenated, Dataframes need to be provided as a list:

all_composers = pd.concat([composers1,composers2])

all_composers

One potential problem is that two tables contain duplicated information:

all_composers.loc['Mahler']

It is very easy to get rid of it using.duplicated() gives us a boolean series of duplications and we can just selected non-duplicated rows:

all_composers.duplicated()

composer
Mahler          False
Beethoven       False
Puccini         False
Shostakovich    False
Verdi           False
Dvorak          False
Schumann        False
Stravinsky      False
Mahler           True
dtype: bool

all_composers[~all_composers.duplicated()]

10.2 Joining two tables¶

An other classical case is that of two list with similar index but containing different information, e.g.

composers1 = pd.read_excel('Data/composers.xlsx', index_col='composer',sheet_name='Sheet1')
composers1

composers2 = pd.read_excel('Data/composers.xlsx', index_col='composer',sheet_name='Sheet4')
composers2

If we we use again simple concatenation, this doesn't help us much. We just end up with a large matrix with lots of NaN's:

pd.concat([composers1, composers2])

The better way of doing this is to join the tables. This is a classical database concept available in Pandas.

join() operates on two tables: the first one is the "left" table which uses join() as a method. The other table is the "right" one.

Let's try the default join settings:

composers1

composers2

composers1.join(composers2)

We see that Pandas was smart enough to notice that the two tables had a index name and used it to combine the tables. We also see that one element from the second table (Brahms) is missing. The reason for this is the way indices not present in both tables are handled. There are four ways of doing this with two tables called here the "left" and "right" table.

10.2.1. Join left¶

Here "left" and "right" just represent two Dataframes that should be merged. They have a common index, but not necessarily the same items. For example here Shostakovich is missing in the second table, while Brahms is missing in the first one. When using the "right" join, we use the first Dataframe as basis and only use the indices that appear there.

composers1.join(composers2, how = 'left')

Hence Brahms is left out.

10.2.2. Join right¶

We can do the the opposite and use the indices of the second Dataframe as basis:

composers1.join(composers2, how = 'right')

Here we have Brahms but not Shostakovich.

10.2.3. Inner, outer¶

Finally, we can just say that we want to recover eihter only the items that appaer in both Dataframes (inner, like in a Venn diagram) or all the items (outer).

composers1.join(composers2, how = 'inner')

composers1.join(composers2, how = 'outer')

10.3.4 Joining on columns : merge¶

Above we have used join to join based on indices. However sometimes tables don't have the same indices but similar contents that we want to merge. For example let's imagine whe have the two Dataframes below:

composers1 = pd.read_excel('Data/composers.xlsx', sheet_name='Sheet1')
composers2 = pd.read_excel('Data/composers.xlsx', sheet_name='Sheet6')

composers1

composers2

The indices don't match and are not the composer name. In addition the columns containing the composer names have different labels. Here we can use merge()and specify which columns we want to use for merging, and what type of merging we need (inner, left etc.)

pd.merge(composers1, composers2, left_on='composer', right_on='last name')

Again we can use another variety of join than the default inner:

pd.merge(composers1, composers2, left_on='composer', right_on='last name',how = 'outer')

pd.merge(composers1, composers2, left_on='composer', right_on='last name',how = 'right')

	birth	death	city
composer
Mahler	1860	1911	Kaliste
Beethoven	1770	1827	Bonn
Puccini	1858	1924	Lucques
Shostakovich	1906	1975	Saint-Petersburg

	birth	death	city
composer
Verdi	1813	1901	Roncole
Dvorak	1841	1904	Nelahozeves
Schumann	1810	1856	Zwickau
Stravinsky	1882	1971	Oranienbaum
Mahler	1860	1911	Kaliste

	birth	death	city
composer
Mahler	1860	1911	Kaliste
Beethoven	1770	1827	Bonn
Puccini	1858	1924	Lucques
Shostakovich	1906	1975	Saint-Petersburg
Verdi	1813	1901	Roncole
Dvorak	1841	1904	Nelahozeves
Schumann	1810	1856	Zwickau
Stravinsky	1882	1971	Oranienbaum
Mahler	1860	1911	Kaliste

	birth	death	city
composer
Mahler	1860	1911	Kaliste
Mahler	1860	1911	Kaliste

	birth	death	city
composer
Mahler	1860	1911	Kaliste
Beethoven	1770	1827	Bonn
Puccini	1858	1924	Lucques
Shostakovich	1906	1975	Saint-Petersburg
Verdi	1813	1901	Roncole
Dvorak	1841	1904	Nelahozeves
Schumann	1810	1856	Zwickau
Stravinsky	1882	1971	Oranienbaum

	first name
composer
Mahler	Gustav
Beethoven	Ludwig van
Puccini	Giacomo
Brahms	Johannes

	birth	death	city	first name
composer
Mahler	1860.0	1911.0	Kaliste	NaN
Beethoven	1770.0	1827.0	Bonn	NaN
Puccini	1858.0	1924.0	Lucques	NaN
Shostakovich	1906.0	1975.0	Saint-Petersburg	NaN
Mahler	NaN	NaN	NaN	Gustav
Beethoven	NaN	NaN	NaN	Ludwig van
Puccini	NaN	NaN	NaN	Giacomo
Brahms	NaN	NaN	NaN	Johannes