import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Often information is comming from different sources and it is necessary to combine it into one object. We are going to see the different ways in which information contained within separate Dataframes can be combined in a meaningful way.
The simplest way we can combine two Dataframes is simply to "paste" them together:
composers1 = pd.read_excel('Datasets/composers.xlsx', index_col='composer',sheet_name='Sheet1')
composers1
composers2 = pd.read_excel('Datasets/composers.xlsx', index_col='composer',sheet_name='Sheet3')
composers2
To be concatenated, Dataframes need to be provided as a list:
all_composers = pd.concat([composers1,composers2])
# columns have the same -> ok
all_composers
One potential problem is that two tables contain duplicated information:
# be careful if same index !! can have twice the same index !
# ! ensure to not have the same index
all_composers.loc['Mahler']
It is very easy to get rid of it using:
all_composers.drop_duplicates()
# suppress duplicates from the table
An other classical case is that of two list with similar index but containing different information, e.g.
composers1 = pd.read_excel('Datasets/composers.xlsx', index_col='composer',sheet_name='Sheet1')
composers1
composers2 = pd.read_excel('Datasets/composers.xlsx', index_col='composer',sheet_name='Sheet4')
composers2
If we we use again simple concatenation, this doesn't help us much. We just end up with a large matrix with lots of NaN's:
pd.concat([composers1, composers2])
The better way of doing this is to join the tables. This is a classical database concept avaialble in Pandas.
join()
operates on two tables: the first one is the "left" table which uses join()
as a method. The other table is the "right" one.
Let's try the default join settings:
composers1.join(composers2)
# get all elements of the 1st table, merged with the 2nd table
# everything based on the left table (what from the 2nd table and is not in the 1st table is dropped)
# by default is left based
We see that Pandas was smart enough to notice that the two tables had a index name and used it to combine the tables. We also see that one element from the second table (Brahms) is missing. The reason for this is the way indices not present in both tables are handled. There are four ways of doing this with two tables called here the "left" and "right" table.
Here "left" and "right" just represent two Dataframes that should be merged. They have a common index, but not necessarily the same items. For example here Shostakovich is missing in the second table, while Brahms is missing in the first one. When using the "right" join, we use the first Dataframe as basis and only use the indices that appear there.
composers1.join(composers2, how = 'left') # this is the default
Hence Brahms is left out.
We can do the the opposite and use the indices of the second Dataframe as basis:
composers1.join(composers2, how = 'right')
Here we have Brahms but not Shostakovich.
Finally, we can just say that we want to recover eihter only the items that appaer in both Dataframes (inner, like in a Venn diagram) or all the items (outer).
composers1.join(composers2, how = 'inner')
# => take all indices of both tables
composers1.join(composers2, how = 'outer')
Above we have used join
to join based on indices. However sometimes tables don't have the same indices but similar contents that we want to merge. For example let's imagine whe have the two Dataframes below:
composers1 = pd.read_excel('Datasets/composers.xlsx', sheet_name='Sheet1')
composers2 = pd.read_excel('Datasets/composers.xlsx', sheet_name='Sheet6')
composers1
composers2
The indices don't match and are not the composer name. In addition the columns containing the composer names have different labels. Here we can use merge()
and specify which columns we want to use for merging, and what type of merging we need (inner, left etc.)
# take left and right tables and can specify which column from each to perform the merge
pd.merge(composers1, composers2, left_on='composer', right_on='last name')
Again we can use another variety of join than the default inner:
pd.merge(composers1, composers2, left_on='composer', right_on='last name',how = 'outer')
pd.merge(composers1, composers2, left_on='composer', right_on='last name',how = 'right')
# MZ: to remove the column that now contains duplicated information:
dt1 = pd.merge(composers1, composers2, left_on='composer', right_on='last name')
dt1
# default is to drop rows, to drop columns set 1st axis
dt1.drop('last name', axis=1)
dt1
dt1.drop('last name', axis=1, inplace=True)
dt1