11. Splitting data¶

Often one has tables that mix regular variables (e.g. the size of cells in microscopy images) with categorical variables (e.g. the type of cell to which they belong). In that case, it is quite usual to split the data by categories or groups to do computations. Pandas allows to do this very easily.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

11.1 Grouping¶

Let's import some data and have a look at them:

composers = pd.read_excel('Data/composers.xlsx', sheet_name='Sheet5')

composers.head()

We also add a column here to calculate the composers' age:

composers['age'] = composers.death - composers.birth

11.1.1 Single level¶

What if we want now to count how many composers we have in a certain category like the period or country? In classical computing we would maybe do a for loop to count occurrences. Pandas simplifies this with the groupby() function, which actually groups elements by a certain criteria, e.g. a categorical variable like the period:

composer_grouped = composers.groupby('period')
composer_grouped

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x11d2fc850>

The output is a bit cryptic. What we actually have is a new object called a group which has a lot of handy properties. First let's see what the groups actually are. We can find all groups with groups:

composer_grouped.groups

{'baroque': Int64Index([14, 16, 17, 20, 21, 28, 29, 30, 31, 47], dtype='int64'),
 'classic': Int64Index([9, 10, 32, 40, 51], dtype='int64'),
 'modern': Int64Index([3, 7, 11, 12, 19, 25, 45, 46, 50, 53, 54, 55, 56], dtype='int64'),
 'post-romantic': Int64Index([0, 2, 8, 18, 49], dtype='int64'),
 'renaissance': Int64Index([13, 26, 27, 36, 37, 43, 44], dtype='int64'),
 'romantic': Int64Index([1, 4, 5, 6, 15, 22, 23, 24, 33, 34, 35, 38, 39, 41, 42, 48, 52], dtype='int64')}

We have a dictionary, where each period that appears in the Dataframe is a key and each key contains a list of dataframe indices of rows with those periods. We will rarely directly use those indices, as most operations on groups only use those "behind the scene".

For example we can use describe() on a group object, just like we did it before for a Dataframe:

composer_grouped.describe()#.loc['Austria','birth']

We see here that the statistical analysis has been done for each group, the index of each row being the group name (or key in the dictionary). If we are interested in a specific group we can also easily recover it:

composer_grouped.get_group('classic')

We see that this returns a sub-group from the original table. Effectively it is almost equivalent to:

composers[composers.period == 'classic']

11.1.2 Multi-level¶

If one has multiple categorical variables, one can also do a grouping on several levels. For example here we want to classify composers both by period and country. For this we just give two column names to the groupby() function:

composer_grouped = composers.groupby(['period','country'])
composer_grouped.describe()

composer_grouped.get_group(('baroque','Germany'))

11.2 Operations on groups¶

The main advantage of this Group object is that it allows us to do very quickly both computations and plotting without having to loop through different categories. Indeed Pandas makes all the work for us: it applies functions on each group and then reassembles the results into a Dataframe (or Series depending on the operation).

For example we can apply most functions we used for Dataframes (mean, sum etc.) on groups as well and Pandas seamlessly does the work for us:

composer_grouped.mean()

composer_grouped.count()

We can also design specific functions (again, like in the case of Dataframes) and apply them on groups:

def mult(myseries):
    return myseries.max() * 3

composer_grouped.apply(mult)

11.3 Reshaping dataframes¶

As we see above, grouping operations can create more or less complex dataframes by adding one or multiple indexing levels. There are multiple ways to "reshape" such dataframes in order to make thm usable e.g. for plotting. Typically, plotting software based on a grammar of graphics expect a simple 2D dataframe where each line is an observation with several properties.

11.3.1 re-indexing, unstacking¶

One of the most common "reshaping" is to reset the index. In its simplest form, it will create a new dataframe, where each row corresponds to one observation. For example in the case of a dataframe with multi-indices, it will re-cast these indices as columns:

composer_grouped = composers.groupby(['period','country']).mean()
composer_grouped.head(10)

composer_grouped.reset_index().head(5)

One can of course be more specific and reset only specific indices e.g. by level:

composer_grouped.reset_index(level=1).head(5)

11.3.2 unstacking¶

Another way to move indices to columns is to unstack a dataframe, in other words pivot some indices to columns:

composer_grouped.unstack()

This creates a multi-level column indexing.

11.3.3 Wide to long: melt¶

A very common operation when handling tables is to switch from wide to long format and vice versa. In our composer example, let's for example imagine that you want both birth and death dates to be grouped in a single column called dates. But you still need to know if that data is a birth or date, so you need a new column that indicates that. To achieve that, we need to specify id_vars a list of columns to be used as identifiers e.g. the composer name, and value_vars, a list of columns that should become rows:

composers.head(5)

pd.melt(composers, id_vars=['composer'], value_vars=['birth', 'death'])

We can keep more of the original columns as identifiers and also specify names for the variable and value columns:

melted = pd.melt(composers, id_vars=['composer','period','age','country'], value_vars=['birth', 'death'],
                var_name = 'date_type', value_name='dates')
melted

11.4 Plotting¶

We have seen above that we can create groups and apply functions to them to get some summary of them as new dataframes or series that could then also be reshaped. The final result of these operations is then ideally suited to be plotted in a very efficient way.

Here's a simple example: we group composers by periods and then calculate the mean age, resulting in a series where periods are indices:

composers.groupby('period')['age'].mean()

period
baroque          56.900000
classic          56.800000
modern           77.181818
post-romantic    73.200000
renaissance      68.142857
romantic         58.764706
Name: age, dtype: float64

We can just add one more operation to that line to create a bar plot illustrating this:

composers.groupby('period')['age'].mean().plot(kind = 'bar');

The built-in plotting capabilities of Pandas automatically used the indices to label the bars, and also used the series name as a general label.

Using more advanced libraries, we can go further than that and use multiple columns to create complex plots. This will be shown in the next chapter.

	composer	birth	death	period	country
0	Mahler	1860	1911.0	post-romantic	Austria
1	Beethoven	1770	1827.0	romantic	Germany
2	Puccini	1858	1924.0	post-romantic	Italy
3	Shostakovich	1906	1975.0	modern	Russia
4	Verdi	1813	1901.0	romantic	Italy

	birth								death					age
	count	mean	std	min	25%	50%	75%	max	count	mean	...	75%	max	count	mean	std	min	25%	50%	75%	max
period
baroque	10.0	1663.300000	36.009412	1587.0	1647.0	1676.5	1685.0	1710.0	10.0	1720.200000	...	1755.25	1764.0	10.0	56.900000	18.716897	26.0	40.25	63.0	70.5	81.0
classic	5.0	1744.400000	12.054045	1731.0	1732.0	1749.0	1754.0	1756.0	5.0	1801.200000	...	1806.00	1809.0	5.0	56.800000	16.238842	35.0	52.00	52.0	68.0	77.0
modern	13.0	1905.692308	28.595992	1854.0	1891.0	1902.0	1918.0	1971.0	11.0	1974.090909	...	1990.00	2016.0	11.0	77.181818	13.136348	50.0	70.50	81.0	88.0	91.0
post-romantic	5.0	1854.200000	17.123084	1824.0	1858.0	1860.0	1864.0	1865.0	5.0	1927.400000	...	1949.00	1957.0	5.0	73.200000	16.115210	51.0	66.00	72.0	85.0	92.0
renaissance	7.0	1527.142857	59.881629	1397.0	1528.5	1540.0	1564.5	1567.0	7.0	1595.285714	...	1624.50	1643.0	7.0	68.142857	12.061351	47.0	62.50	69.0	76.5	83.0
romantic	17.0	1824.823529	25.468695	1770.0	1810.0	1824.0	1841.0	1867.0	17.0	1883.588235	...	1904.00	1919.0	17.0	58.764706	13.250139	34.0	49.00	60.0	66.0	88.0

	composer	birth	death	period	country	age
9	Haydn	1732	1809.0	classic	Austria	77.0
10	Mozart	1756	1791.0	classic	Austria	35.0
32	Cimarosa	1749	1801.0	classic	Italy	52.0
40	Soler	1754	1806.0	classic	Spain	52.0
51	Dusek	1731	1799.0	classic	Czechia	68.0

	composer	birth	death	period	country	age
9	Haydn	1732	1809.0	classic	Austria	77.0
10	Mozart	1756	1791.0	classic	Austria	35.0
32	Cimarosa	1749	1801.0	classic	Italy	52.0
40	Soler	1754	1806.0	classic	Spain	52.0
51	Dusek	1731	1799.0	classic	Czechia	68.0

		birth								death					age
		count	mean	std	min	25%	50%	75%	max	count	mean	...	75%	max	count	mean	std	min	25%	50%	75%	max
period	country
baroque	England	1.0	1659.000000	NaN	1659.0	1659.00	1659.0	1659.00	1659.0	1.0	1695.000000	...	1695.00	1695.0	1.0	36.000000	NaN	36.0	36.00	36.0	36.00	36.0
	France	3.0	1650.666667	29.263174	1626.0	1634.50	1643.0	1663.00	1683.0	3.0	1709.666667	...	1734.00	1764.0	3.0	59.000000	23.065125	35.0	48.00	61.0	71.00	81.0
	Germany	2.0	1685.000000	0.000000	1685.0	1685.00	1685.0	1685.00	1685.0	2.0	1754.500000	...	1756.75	1759.0	2.0	69.500000	6.363961	65.0	67.25	69.5	71.75	74.0
	Italy	4.0	1663.000000	53.285395	1587.0	1649.25	1677.5	1691.25	1710.0	4.0	1717.250000	...	1741.25	1757.0	4.0	54.250000	20.434856	26.0	46.25	59.5	67.50	72.0
classic	Austria	2.0	1744.000000	16.970563	1732.0	1738.00	1744.0	1750.00	1756.0	2.0	1800.000000	...	1804.50	1809.0	2.0	56.000000	29.698485	35.0	45.50	56.0	66.50	77.0
	Czechia	1.0	1731.000000	NaN	1731.0	1731.00	1731.0	1731.00	1731.0	1.0	1799.000000	...	1799.00	1799.0	1.0	68.000000	NaN	68.0	68.00	68.0	68.00	68.0
	Italy	1.0	1749.000000	NaN	1749.0	1749.00	1749.0	1749.00	1749.0	1.0	1801.000000	...	1801.00	1801.0	1.0	52.000000	NaN	52.0	52.00	52.0	52.00	52.0
	Spain	1.0	1754.000000	NaN	1754.0	1754.00	1754.0	1754.00	1754.0	1.0	1806.000000	...	1806.00	1806.0	1.0	52.000000	NaN	52.0	52.00	52.0	52.00	52.0
modern	Austria	1.0	1885.000000	NaN	1885.0	1885.00	1885.0	1885.00	1885.0	1.0	1935.000000	...	1935.00	1935.0	1.0	50.000000	NaN	50.0	50.00	50.0	50.00	50.0
	Czechia	1.0	1854.000000	NaN	1854.0	1854.00	1854.0	1854.00	1854.0	1.0	1928.000000	...	1928.00	1928.0	1.0	74.000000	NaN	74.0	74.00	74.0	74.00	74.0
	England	2.0	1936.500000	48.790368	1902.0	1919.25	1936.5	1953.75	1971.0	1.0	1983.000000	...	1983.00	1983.0	1.0	81.000000	NaN	81.0	81.00	81.0	81.00	81.0
	France	2.0	1916.500000	12.020815	1908.0	1912.25	1916.5	1920.75	1925.0	2.0	2004.000000	...	2010.00	2016.0	2.0	87.500000	4.949747	84.0	85.75	87.5	89.25	91.0
	Germany	1.0	1895.000000	NaN	1895.0	1895.00	1895.0	1895.00	1895.0	1.0	1982.000000	...	1982.00	1982.0	1.0	87.000000	NaN	87.0	87.00	87.0	87.00	87.0
	RUssia	1.0	1891.000000	NaN	1891.0	1891.00	1891.0	1891.00	1891.0	1.0	1953.000000	...	1953.00	1953.0	1.0	62.000000	NaN	62.0	62.00	62.0	62.00	62.0
	Russia	2.0	1894.000000	16.970563	1882.0	1888.00	1894.0	1900.00	1906.0	2.0	1973.000000	...	1974.00	1975.0	2.0	79.000000	14.142136	69.0	74.00	79.0	84.00	89.0
	USA	3.0	1918.333333	18.502252	1900.0	1909.00	1918.0	1927.50	1937.0	2.0	1990.000000	...	1990.00	1990.0	2.0	81.000000	12.727922	72.0	76.50	81.0	85.50	90.0
post-romantic	Austria	2.0	1842.000000	25.455844	1824.0	1833.00	1842.0	1851.00	1860.0	2.0	1903.500000	...	1907.25	1911.0	2.0	61.500000	14.849242	51.0	56.25	61.5	66.75	72.0
	Finland	1.0	1865.000000	NaN	1865.0	1865.00	1865.0	1865.00	1865.0	1.0	1957.000000	...	1957.00	1957.0	1.0	92.000000	NaN	92.0	92.00	92.0	92.00	92.0
	Germany	1.0	1864.000000	NaN	1864.0	1864.00	1864.0	1864.00	1864.0	1.0	1949.000000	...	1949.00	1949.0	1.0	85.000000	NaN	85.0	85.00	85.0	85.00	85.0
	Italy	1.0	1858.000000	NaN	1858.0	1858.00	1858.0	1858.00	1858.0	1.0	1924.000000	...	1924.00	1924.0	1.0	66.000000	NaN	66.0	66.00	66.0	66.00	66.0
renaissance	Belgium	2.0	1464.500000	95.459415	1397.0	1430.75	1464.5	1498.25	1532.0	2.0	1534.000000	...	1564.00	1594.0	2.0	69.500000	10.606602	62.0	65.75	69.5	73.25	77.0
	England	2.0	1551.500000	16.263456	1540.0	1545.75	1551.5	1557.25	1563.0	2.0	1624.500000	...	1625.25	1626.0	2.0	73.000000	14.142136	63.0	68.00	73.0	78.00	83.0
	Italy	3.0	1552.666667	23.965253	1525.0	1545.50	1566.0	1566.50	1567.0	3.0	1616.666667	...	1628.00	1643.0	3.0	64.000000	15.132746	47.0	58.00	69.0	72.50	76.0
romantic	Czechia	2.0	1832.500000	12.020815	1824.0	1828.25	1832.5	1836.75	1841.0	2.0	1894.000000	...	1899.00	1904.0	2.0	61.500000	2.121320	60.0	60.75	61.5	62.25	63.0
	France	3.0	1821.000000	19.672316	1803.0	1810.50	1818.0	1830.00	1842.0	3.0	1891.333333	...	1902.50	1912.0	3.0	70.333333	4.509250	66.0	68.00	70.0	72.50	75.0
	Germany	4.0	1806.500000	26.388129	1770.0	1800.00	1811.5	1818.00	1833.0	4.0	1865.750000	...	1886.50	1897.0	4.0	59.250000	10.307764	46.0	54.25	60.5	65.50	70.0
	Italy	4.0	1817.250000	28.004464	1797.0	1800.00	1807.0	1824.25	1858.0	4.0	1875.750000	...	1905.50	1919.0	4.0	58.500000	22.605309	34.0	46.75	56.0	67.75	88.0
	Russia	2.0	1836.000000	4.242641	1833.0	1834.50	1836.0	1837.50	1839.0	2.0	1884.000000	...	1885.50	1887.0	2.0	48.000000	8.485281	42.0	45.00	48.0	51.00	54.0
	Spain	2.0	1863.500000	4.949747	1860.0	1861.75	1863.5	1865.25	1867.0	2.0	1912.500000	...	1914.25	1916.0	2.0	49.000000	0.000000	49.0	49.00	49.0	49.00	49.0

	composer	birth	death	period	country	age
14	Haendel	1685	1759.0	baroque	Germany	74.0
47	Bach	1685	1750.0	baroque	Germany	65.0

		composer	birth	death	period	country	age
period	country
baroque	England	PurcellPurcellPurcell	4977	5085.0	baroquebaroquebaroque	EnglandEnglandEngland	108.0
	France	RameauRameauRameau	5049	5292.0	baroquebaroquebaroque	FranceFranceFrance	243.0
	Germany	HaendelHaendelHaendel	5055	5277.0	baroquebaroquebaroque	GermanyGermanyGermany	222.0
	Italy	ScarlattiScarlattiScarlatti	5130	5271.0	baroquebaroquebaroque	ItalyItalyItaly	216.0
classic	Austria	MozartMozartMozart	5268	5427.0	classicclassicclassic	AustriaAustriaAustria	231.0
	Czechia	DusekDusekDusek	5193	5397.0	classicclassicclassic	CzechiaCzechiaCzechia	204.0
	Italy	CimarosaCimarosaCimarosa	5247	5403.0	classicclassicclassic	ItalyItalyItaly	156.0
	Spain	SolerSolerSoler	5262	5418.0	classicclassicclassic	SpainSpainSpain	156.0
modern	Austria	BergBergBerg	5655	5805.0	modernmodernmodern	AustriaAustriaAustria	150.0
	Czechia	JanacekJanacekJanacek	5562	5784.0	modernmodernmodern	CzechiaCzechiaCzechia	222.0
	England	WaltonWaltonWalton	5913	5949.0	modernmodernmodern	EnglandEnglandEngland	243.0
	France	MessiaenMessiaenMessiaen	5775	6048.0	modernmodernmodern	FranceFranceFrance	273.0
	Germany	OrffOrffOrff	5685	5946.0	modernmodernmodern	GermanyGermanyGermany	261.0
	RUssia	ProkofievProkofievProkofiev	5673	5859.0	modernmodernmodern	RUssiaRUssiaRUssia	186.0
	Russia	StravinskyStravinskyStravinsky	5718	5925.0	modernmodernmodern	RussiaRussiaRussia	267.0
	USA	GlassGlassGlass	5811	5970.0	modernmodernmodern	USAUSAUSA	270.0
post-romantic	Austria	MahlerMahlerMahler	5580	5733.0	post-romanticpost-romanticpost-romantic	AustriaAustriaAustria	216.0
	Finland	SibeliusSibeliusSibelius	5595	5871.0	post-romanticpost-romanticpost-romantic	FinlandFinlandFinland	276.0
	Germany	StraussStraussStrauss	5592	5847.0	post-romanticpost-romanticpost-romantic	GermanyGermanyGermany	255.0
	Italy	PucciniPucciniPuccini	5574	5772.0	post-romanticpost-romanticpost-romantic	ItalyItalyItaly	198.0
renaissance	Belgium	LassusLassusLassus	4596	4782.0	renaissancerenaissancerenaissance	BelgiumBelgiumBelgium	231.0
	England	DowlandDowlandDowland	4689	4878.0	renaissancerenaissancerenaissance	EnglandEnglandEngland	249.0
	Italy	PalestrinaPalestrinaPalestrina	4701	4929.0	renaissancerenaissancerenaissance	ItalyItalyItaly	228.0
romantic	Czechia	SmetanaSmetanaSmetana	5523	5712.0	romanticromanticromantic	CzechiaCzechiaCzechia	189.0
	France	MassenetMassenetMassenet	5526	5736.0	romanticromanticromantic	FranceFranceFrance	225.0
	Germany	WagnerWagnerWagner	5499	5691.0	romanticromanticromantic	GermanyGermanyGermany	210.0
	Italy	VerdiVerdiVerdi	5574	5757.0	romanticromanticromantic	ItalyItalyItaly	264.0
	Russia	MussorsgskyMussorsgskyMussorsgsky	5517	5661.0	romanticromanticromantic	RussiaRussiaRussia	162.0
	Spain	GranadosGranadosGranados	5601	5748.0	romanticromanticromantic	SpainSpainSpain	147.0