0. Pandas introduction¶

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from plotnine import ggplot, aes
import plotnine as pn

Why should we use Pandas instad of just opening our table in Excel? While Excel is practical to browse through data, it is very cumbersome to use it to combine, re-arrange and thoroughly analyze data: code is hidden and difficult to share, there's no version control, it's difficult to automate tasks, the manual clicking around leads to mistakes etc.

In this course you will learn how to handle tabular data with Pandas, a Python package widely used in the scientific and data science areas. You will learn how to create and import tables, how to combine them, modify them, do statistical analysis on them and finally how to use them to easily create complex visualisations.

So that you see where this leads, we start with a short example of how Pandas can be used in a project. We look here at tables provided by the Federal Office of Statistics about all Swiss towns every year (population, number of vacant houses, unemployement etc.)

0.1 From import to plot in 3 lines¶

Import: In a first step we import the spread sheet:

towns = pd.read_excel('Datasets/2013.xls', skiprows=list(range(5))+list(range(6,9)), skipfooter=34, index_col='Commune',na_values=['*'])

Then we can look at it:

towns.head()

And in one line do a rather complex plot where cities are split into size groups and their average vote calcualted:

ax = towns.groupby(pd.cut(towns['Habitants'], np.arange(0, 100000, 10000)),as_index=False).mean().dropna().plot(x = 'Habitants', y = ['UDC','PSS'])
ax.set_ylabel('% Vote')
plt.show()

0.2 Combining data and assemble complex plot¶

Above we imported a single table. However often you need to combine information from multiple sources or multiple experiments. This can be extremely tedious to do in Excel. Here it is done in a few lines:

import all tables in a loop

towns = []
for x in range(2013, 2015):
    temp_town = pd.read_excel('Datasets/'+str(x)+'.xls', skiprows=list(range(5))+list(range(6,9)), skipfooter=34, index_col='Commune',na_values=['*','+'])
    temp_town['year'] = x
    towns.append(temp_town)

assemble the tables (concatenation)

towns_concat = pd.concat(towns,sort=False)

Make a list of cities you are interested in and select only those:

cities = ['Zürich', 'Bern','Lausanne','Basel','Genève']

Plot for each city a parameter you are interested in like the fraction of vacant appartments:

import seaborn as sns

g = sns.lineplot(data = towns_concat.loc[cities].reset_index(), x = 'year', y = 'Taux de logements vacants',
            hue = 'Commune', marker = "o", dashes=False);
plt.legend(loc='upper left')
plt.show()

We can also exploit the plotting capabilities of advanced packages such as plotnine, a Python version of ggplot, to create complex plots with little effort. For example here, we show how the voting depends on how much a town depends on agriculture. We separate the data by year as well as py party.

First we just import two years of data (two different parliaments):

towns = []
for x in [2014, 2018]:
    temp_town = pd.read_excel('Datasets/'+str(x)+'.xls', skiprows=list(range(5))+list(range(6,9)), skipfooter=34, index_col='Commune',na_values=['*','+'])
    temp_town['year'] = x
    towns.append(temp_town)
towns_concat = pd.concat(towns,sort=False)

We recover the necessary information and do some data reshaping (tidying) to be able to easily realize the plot:

towns_parties = towns_concat.reset_index()[['year','UDC','PS','Commune','Surface agricole en %']]

#wide to long
towns_parties = pd.melt(towns_parties, id_vars=['Commune','Surface agricole en %','year'],
                        value_vars=["UDC", "PS"], value_name='Vote fraction', var_name='Party')

towns_parties.head()

And finally we can plot our data:

sns.lmplot(data = towns_parties.dropna(), x = 'Surface agricole en %',
          y = 'Vote fraction', hue = 'year',col = 'Party',
           scatter_kws={'alpha' :0.1})

<seaborn.axisgrid.FacetGrid at 0x7f3a68867e10>

(ggplot(towns_parties.dropna(),aes(x = 'Surface agricole en %', y = 'Vote fraction', color = 'factor(year)', shape = 'Party'))
 + pn.geom_point(alpha = 0.05)
 + pn.geom_smooth(method='lm')
 + pn.stats.stat_summary_bin(fun_data = 'mean_cl_normal', bins = 30)#fun_y = np.mean, fun_ymin=np.var, fun_ymax=np.var, bins = 30)
 + pn.labs(color = 'Year')
 + pn.scale_color_manual(['red','blue'])
).draw();

/usr/local/lib/python3.5/dist-packages/pandas/core/computation/check.py:19: UserWarning: The installed version of numexpr 2.4.3 is not supported in pandas and will be not be used
The minimum supported version is 2.6.1

  ver=ver, min_ver=_MIN_NUMEXPR_VERSION), UserWarning)
/usr/local/lib/python3.5/dist-packages/numpy/core/fromnumeric.py:2223: FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
  return ptp(axis=axis, out=out, **kwargs)
/usr/local/lib/python3.5/dist-packages/numpy/core/fromnumeric.py:2223: FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
  return ptp(axis=axis, out=out, **kwargs)
/usr/local/lib/python3.5/dist-packages/numpy/core/fromnumeric.py:2223: FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
  return ptp(axis=axis, out=out, **kwargs)
/usr/local/lib/python3.5/dist-packages/numpy/core/fromnumeric.py:2223: FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
  return ptp(axis=axis, out=out, **kwargs)

1. Pandas objects¶

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

If you have already used Python, you know about its standard data structures (list, dicts etc). If you use Python for science, you also probably know Numpy arrays which underlying almost all other specialized scientific packages.

None of these structures offers a simple way to handle database style data, nor to easily do standard database operations. This is why Pandas exists: it offers a complete ecosystem of structures and functions dedicated to handle large tables with inhomogeneous contents.

In this first chapter, we are going to learn about the two main structures of Pandsa: Series and Dataframes.

1.1 Series¶

1.1.1 Simple series¶

Series are a the Pandas version of 1-D Numpy arrays. To understand their specificities, let's create one. Usually Pandas structures (Series and Dataframes) are creates from other simpler structures like Numpy arrays or dictionaries:

(MZ: Pandas builds on numpy; Series are equivalent of a list)

numpy_array = np.array([4,8,38,1,6])

# MZ:
numpy_array**2

array([  16,   64, 1444,    1,   36])

The function pd.Series() allows us to convert objects into Series:

pd_series = pd.Series(numpy_array)
pd_series
# on the left => the indices; can be anything, e.g. names of towns

0     4
1     8
2    38
3     1
4     6
dtype: int64

The underlying structure can be recovered with the .values attribute:

pd_series.values
# MZ: output is numpy array

array([ 4,  8, 38,  1,  6])

Otherwise, indexing works as for regular arrays:

pd_series[1]

8

1.1.2 Indexing¶

On top of accessing values in a series by regular indexing, one can create custom indices for each element in the series:

pd_series2 = pd.Series(numpy_array, index=['a', 'b', 'c', 'd','e'])
# MZ: force the index to be what we give
# MZ: to retrieve the indexes (keys)
pd_series2.keys()

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

pd_series2

a     4
b     8
c    38
d     1
e     6
dtype: int64

Now a given element can be accessed either by using its regular index:

pd_series2[1]

8

or its chosen index:

pd_series2['b']

8

A more direct way to create specific indexes is to transform as dictionary into a Series:

# MZ: use dict to build the Series
composer_birth = {'Mahler': 1860, 'Beethoven': 1770, 'Puccini': 1858, 'Shostakovich': 1906}

pd_composer_birth = pd.Series(composer_birth)
pd_composer_birth

Beethoven       1770
Mahler          1860
Puccini         1858
Shostakovich    1906
dtype: int64

pd_composer_birth['Puccini']

1858

1.2 Dataframes¶

In most cases, one has to deal with more than just one variable, e.g. one has the birth year and the death year of a list of composers. Also one might have different types of information, e.g. in addition to numerical variables (year) one might have string variables like the city of birth. The Pandas structure that allow one to deal with such complex data is called a Dataframe, which can somehow be seen as an aggregation of Series with a common index.

1.2.1 Creating a Dataframe¶

To see how to construct such a Dataframe, let's create some more information about composers:

composer_death = pd.Series({'Mahler': 1911, 'Beethoven': 1827, 'Puccini': 1924, 'Shostakovich': 1975})
composer_city_birth = pd.Series({'Mahler': 'Kaliste', 'Beethoven': 'Bonn', 'Puccini': 'Lucques', 'Shostakovich': 'Saint-Petersburg'})

Now we can combine multiple series into a Dataframe by precising a variable name for each series. Note that all our series need to have the same indices (here the composers' name):

# MZ: put Series together into a Dataframe
composers_df = pd.DataFrame({'birth': pd_composer_birth, 'death': composer_death, 'city': composer_city_birth})
composers_df

A more common way of creating a Dataframe is to construct it directly from a dictionary of lists:

dict_of_list = {'birth': [1860, 1770, 1858, 1906], 'death':[1911, 1827, 1924, 1975], 
 'city':['Kaliste', 'Bonn', 'Lucques', 'Saint-Petersburg']}

pd.DataFrame(dict_of_list)
# MZ: default indexes from 0 to 3

However we now lost the composers name. We can enforce it by providing, as we did before for the Series, a list of indices:

pd.DataFrame(dict_of_list, index=['Mahler', 'Beethoven', 'Puccini', 'Shostakovich'])
# MZ: you can explicitly pass the index

1.2.2 Accessing values¶

There are multiple ways of accessing values or series of values in a Dataframe. Unlike in Series, a simple bracket gives access to a column and not an index, for example:

composers_df['city']

Beethoven                   Bonn
Mahler                   Kaliste
Puccini                  Lucques
Shostakovich    Saint-Petersburg
Name: city, dtype: object

returns a Series. Alternatively one can also use the attributes synthax and access columns by using:

composers_df.city
# rather recommended to use the brackets

Beethoven                   Bonn
Mahler                   Kaliste
Puccini                  Lucques
Shostakovich    Saint-Petersburg
Name: city, dtype: object

The attributes synthax has some limitations, so in case something does not work as expected, revert to the brackets notation.

When specifiying multiple columns, a DataFrame is returned:

composers_df[['city', 'birth']]

One of the important differences with a regular Numpy array is that here, regular indexing doesn't work:

#composers_df[0,0]

Instead one has to use either the .iloc[] or the .loc[] method. .ìloc[] can be used to recover the regular indexing:

# MZ: recover by positions 
composers_df.iloc[0,1]

'Bonn'

While .loc[] allows one to recover elements by using the explicit index, on our case the composers name:

# MZ: recover by indices
composers_df.loc['Mahler','death']

1911

Remember that loc and ``ìloc``` use brackets [] and not parenthesis ().

Numpy style indexing works here too

composers_df.iloc[1:3,:]

If you are working with a large table, it might be useful to sometimes have a list of all the columns. This is given by the .keys() attribute:

composers_df.keys()

Index(['birth', 'city', 'death'], dtype='object')

1.2.3 Adding columns¶

It is very simple to add a column to a Dataframe. One can e.g. just create a column a give it a default value that we can change later:

# MZ: if pass a single value, add the same value everywhere
composers_df['country'] = 'default'

composers_df

Or one can use an existing list:

country = ['Austria','Germany','Italy','Russia']

composers_df['country2'] = country

composers_df

2. Importing excel files¶

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import seaborn as sns

We have seen in the previous chapter what structures are offered by Pandas and how to create them. Another very common way of "creating" a Pandas Dataframe is by importing a table from another format like CSV or Excel.

2.1 Simple import¶

An Excel table containing the same information as we had in Chapter 1 is provided in composers.xlsx and can be read with the read_excel function. There are many more readers for other types of data (csv, json, html etc.) but we focus here on Excel.

pd.read_excel('Datasets/composers.xlsx')

The reader automatically recognized the heaers of the file. However it created a new index. If needed we can specify which column to use as header:

pd.read_excel('Datasets/composers.xlsx', index_col = 'composer')
# MZ: specify which column you want to be the index

If we open the file in Excel, we see that it is composed of more than one sheet. Clearly, when not specifying anything, the reader only reads the first sheet. However we can specify a sheet:

pd.read_excel('Datasets/composers.xlsx', index_col = 'composer',sheet_name='Sheet2')

For each reader, there is a long list of options to specify how the file should be read. We can see all these options using the help (see below). Imagine that our tables contains a title and unnecessary rows: we can use the skiprows argument. Imagine you have dates in your table: you can use the date_parser argument to specify how to format them etc.

#use shift+tab within the parenthesis to see optional arguemnts
#pd.read_excel()

2.2 Handling unknown values¶

As you can see above, some information is missing. Some missing values are marked as "unknown" while other are NaN. NaN is the standard symbol for unknown/missing values and is understood by Pandas while "unknown" is just seen as text. This is impractical as now we have e.g. columns with a mix of numbers and text which will make later computations difficult. What we would like to do is to replace all "irrelevant" values with the standard NaN symbol that says "no information".

For this we can use the na_values argument to specify what should be a NaN. Let's compare the two imports:

import1 = pd.read_excel('Datasets/composers.xlsx', index_col = 'composer',sheet_name='Sheet2')
import1

# MZ: specify which values should be set to NA
import2 = pd.read_excel('Datasets/composers.xlsx', index_col = 'composer',sheet_name='Sheet2', na_values=['unknown'])
import2

If we look now at one column, we can see that columns have been imported in different ways. One column is an object, i.e. mixed types, the other contains floats.

import1.birth

composer
Mahler             1860
Beethoven          1770
Puccini            1858
Shostakovich       1906
Sibelius        unknown
Haydn               NaN
Name: birth, dtype: object

import2.birth

composer
Mahler          1860.0
Beethoven       1770.0
Puccini         1858.0
Shostakovich    1906.0
Sibelius           NaN
Haydn              NaN
Name: birth, dtype: float64

Handling bad or missing values is a very important part of data science. Taking care of the most common occurrences at import is a good solution.

2.3 Column types¶

We see above that the birth column has been "classified" as a float. However we know that this is not the case, it's just an integer. Here again, we can specify the column type already at import time using the dtype option and a dictionary:

# MZ: explicitly specify what types should be
# (e.g. to have faster computations using integers instead of floats)
import2 = pd.read_excel('Datasets/composers.xlsx', index_col = 'composer',sheet_name='Sheet1', na_values=['unknown'],
                       dtype={'composer':np.str,'birth':np.int32,'death':np.int32,'city':np.str})

import2.birth

composer
Mahler          1860
Beethoven       1770
Puccini         1858
Shostakovich    1906
Name: birth, dtype: int32

2.4 Modifications after import¶

Of course we don't have to do all these adjustement at import time. We can also do a default import and check what has to be corrected afterward.

2.4.1 Create NaNs¶

If we missed some bad values at import we can just replace all those directly in the dataframe. We can achieve that by using the replace() method and specifying what should be replaced:

import1

import_nans = import1.replace('unknown', np.nan) # MZ: what replace, with what to replace as arguments
import_nans.birth

composer
Mahler          1860.0
Beethoven       1770.0
Puccini         1858.0
Shostakovich    1906.0
Sibelius           NaN
Haydn              NaN
Name: birth, dtype: float64

Note that when we fix "bad" values, e.g. here the "unknown" text value with NaNs, Pandas automatically adjust the type of the column, allowing us for exampel to later do mathemtical operations.

import1.birth.dtype

dtype('O')

import_nans.birth.dtype
# MZ: was before mix type -> after fixing the 'unknown' -> automatically update the type

dtype('float64')

2.4.2 Changing the type¶

We can also change the type of a column on an existing Dataframe:

import2.birth

composer
Mahler          1860
Beethoven       1770
Puccini         1858
Shostakovich    1906
Name: birth, dtype: int32

import2.birth.astype('float')

composer
Mahler          1860.0
Beethoven       1770.0
Puccini         1858.0
Shostakovich    1906.0
Name: birth, dtype: float64

If we look again at import2:

import2.birth

composer
Mahler          1860
Beethoven       1770
Puccini         1858
Shostakovich    1906
Name: birth, dtype: int32

we see that we didn't actually change the type. Changes on a Dataframe are only effective if we reassign the column:

import2.birth = import2.birth.astype('float')

import2.birth

composer
Mahler          1860.0
Beethoven       1770.0
Puccini         1858.0
Shostakovich    1906.0
Name: birth, dtype: float64

2.5 Plotting¶

We will learn more about plotting later, but let's see here some possibilities offered by Pandas. Pandas builds on top of Matplotlib but exploits the knowledge included in Dataframes to improve the default output. Let's see with a simple dataset.

composers = pd.read_excel('Datasets/composers.xlsx', sheet_name='Sheet5')

We can pass Series to Matplotlib which manages to understand them. Here's a default scatter plot:

plt.plot(composers.birth, composers.death, 'o')
plt.show()

Now we look at the default Pandas output. Different types of plots are accessible when using the data_frame.plot function via the kind option. The variables to plot are column names passed as keywords instead of whole series like in Matplotlib:

composers.plot(x = 'birth', y = 'death', kind = 'scatter')
# MZ: in this way you directly have the labels
plt.show()

We see that the plot automatically gets axis labels. Another gain is that some obvious options like setting a title are directly accesible when creating the plot:

composers.plot(x = 'birth', y = 'death', kind = 'scatter',
               title = 'Composer birth and death', grid = True, fontsize = 15)
plt.show()

One can add even more information on the plot by using more arguments used in a similar way as a grammar of graphics. For example we can color the scatter plot by periods:

composers.plot(x = 'birth', y = 'death',kind = 'scatter',
               c = composers.period.astype('category').cat.codes, colormap = 'Reds', title = 'Composer birth and death', grid = True, fontsize = 15)
plt.show()

Here you see already a limitation of the plotting library. To color dots by the peiod category, we had to turn the latter into a series of numbers. We could then rename those to improve the plot, but it's better to use more specialized packages such as Seaborn which allow to realize this kind of plot easily:

sns.scatterplot(data = composers, x = 'birth', y = 'death', hue = 'period') # MZ: 'hue' to get the colors with the dots
plt.show()

Some additional plotting options are available in the plot() module. For example histograms:

composers.plot.hist(alpha = 0.5)
plt.show()

Here you see again the gain from using Pandas: without specifying anything, Pandas made a histogram of the two columns containing numbers, labelled the axis and even added a legend to the plot.

All these features are very nice and very helpful when exploring a dataset. When anaylzing data in depth and creating complex plots, Pandas's plotting might however be limiting and other options such as Seaborn or Plotnine can be used.

Finally, all plots can be "styled" down to the smallest detail, either by using Matplotlib options or by directly applying a style e.g.:

plt.style.use('ggplot')

composers.plot.hist(alpha = 0.5)
plt.show()

3. Operations with Pandas objects¶

import pandas as pd
import numpy as np

One of the great advantages of using Pandas to handle tabular data is how simple it is to extract valuable information from them. Here we are going to see various types of operations that are available for this.

3.1 Matrix types of operations¶

The strength of Numpy is its natural way of handling matrix operations, and Pandas reuses a lot of these features. For example one can use simple mathematical operations to opereate at the cell level:

compo_pd = pd.read_excel('Datasets/composers.xlsx')
compo_pd

compo_pd['birth']*2

/usr/local/lib/python3.5/dist-packages/pandas/core/computation/check.py:19: UserWarning: The installed version of numexpr 2.4.3 is not supported in pandas and will be not be used
The minimum supported version is 2.6.1

  ver=ver, min_ver=_MIN_NUMEXPR_VERSION), UserWarning)

0    3720
1    3540
2    3716
3    3812
Name: birth, dtype: int64

np.log(compo_pd['birth'])

0    7.528332
1    7.478735
2    7.527256
3    7.552762
Name: birth, dtype: float64

Here we applied functions only to series. Indeed, since our Dataframe contains e.g. strings, no operation can be done on it:

#compo_pd+1

If however we have a homogenous Dataframe, this is possible:

compo_pd[['birth','death']]*2

3.2 Column operations¶

There are other types of functions whose purpose is to summarize the data. For example the mean or standard deviation. Pandas by default applies such functions column-wise and returns a series containing e.g. the mean of each column:

np.mean(compo_pd)

birth    1848.50
death    1909.25
dtype: float64

Note that columns for which a mean does not make sense, like the city are discarded. A series of commun functions like mean or standard deviation are directly implemented as methods and can be accessed in the alternative form:

compo_pd.mean()

birth    1848.50
death    1909.25
dtype: float64

If you need the mean of only a single column you can of course chains operations:

compo_pd.birth.mean()

1848.5

3.3 Operations between Series¶

We can also do computations with multiple series as we would do with Numpy arrays:

compo_pd['death']-compo_pd['birth']
# operations between columns

0    51
1    57
2    66
3    69
dtype: int64

We can even use the result of this computation to create a new column in our Dataframe:

compo_pd['age'] = compo_pd['death']-compo_pd['birth']

compo_pd

3.4 Other functions¶

Sometimes one needs to apply to a column a very specific function that is not provided by default. In that case we can use one of the different apply methods of Pandas.

The simplest case is to apply a function to a column, or Series of a DataFrame. Let's say for example that we want to define the the age >60 as 'old' and <60 as 'young'. We can define the following general function:

def define_age(x):
    if x>60:
        return 'old'
    else:
        return 'young'

define_age(30)

'young'

define_age(70)

'old'

We can now apply this function on an entire Series:

compo_pd.age.apply(define_age)
# MZ: apply take variable inputs and return variable outputs
# to apply, you can pass Series, DataFrame; can return DataFrame or single numbers or list of numbers, etc.

0    young
1    young
2      old
3      old
Name: age, dtype: object

# MZ: or using lambda function
compo_pd.age.apply(lambda x: x**2)

0    2601
1    3249
2    4356
3    4761
Name: age, dtype: int64

And again, if we want, we can directly use this output to create a new column:

compo_pd['age_def'] = compo_pd.age.apply(define_age)
compo_pd
# MZ: NB: you can also apply functions to rows of the dataframe
# can be useful to create categorical variables

We can also apply a function to an entire DataFrame. For example we can ask how many composers have birth and death dates within the XIXth century:

def nineteen_century_count(x):
    return np.sum((x>=1800)&(x<1900))
#def nineteen_century_count2(x):
#    return np.sum((x>=1800)and(x<1900)) # does not work !!

5 < 10 and 5 < 6

True

compo_pd[['birth','death']].apply(nineteen_century_count)
#compo_pd[['birth','death']].apply(nineteen_century_count2)

birth    2
death    1
dtype: int64

The function is applied column-wise and returns a single number for each in the form of a series.

def nineteen_century_true(x):
    return (x>=1800)&(x<1900)

compo_pd[['birth','death']].apply(nineteen_century_true)

Here the operation is again applied column-wise but the output is a Series.

There are more combinations of what can be the in- and output of the apply function and in what order (column- or row-wise) they are applied that cannot be covered here.

3.5 Logical indexing¶

Just like with Numpy, it is possible to subselect parts of a Dataframe using logical indexing. Let's have a look again at an example:

compo_pd

If we use a logical comprison on a series, this yields a logical Series:

compo_pd['birth'] > 1859

0     True
1    False
2    False
3     True
Name: birth, dtype: bool

log_indexer = compo_pd['birth'] > 1859
log_indexer
compo_pd[log_indexer]
# MZ: select the rows based on logical indexing
# MZ: to negate the logicals
compo_pd[~log_indexer]
# ! again here not is not working !
# compo_pd[not log_indexer] # ERROR !

Just like in Numpy we can use this logical Series as an index to select elements in the Dataframe:

compo_pd[compo_pd['birth'] > 1859]

We can also create more complex logical indexings:

compo_pd[(compo_pd['birth'] > 1859)&(compo_pd['age']>60)]

And we can create new arrays containing only these subselections:

compos_sub = compo_pd[compo_pd['birth'] > 1859]
# MZ how the 2 are connected ??? tricky to know if Pandas create a copy or not 
# best to explicitly create a copy !

We can then modify the new array:

compos_sub.loc[0,'birth'] = 3000
# warning to tell that something might go wrong (because not used copy)

/usr/local/lib/python3.5/dist-packages/pandas/core/indexing.py:543: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s

Note that we get this SettingWithCopyWarning warning. This is a very common problem hand has to do with how new arrays are created when making subselections. Simply stated, did we create an entirely new array or a "view" of the old one? This will be very case-dependent and to avoid this, if we want to create a new array we can just enforce it using the copy() method (for more information on the topic see for example this explanation:

# MZ: better to explicitly create a copy
compos_sub2 = compo_pd[compo_pd['birth'] > 1859].copy()
compos_sub2.loc[0,'birth'] = 3000

04-Combining information in Pandas¶

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Often information is comming from different sources and it is necessary to combine it into one object. We are going to see the different ways in which information contained within separate Dataframes can be combined in a meaningful way.

4.1 Concatenation¶

The simplest way we can combine two Dataframes is simply to "paste" them together:

composers1 = pd.read_excel('Datasets/composers.xlsx', index_col='composer',sheet_name='Sheet1')
composers1

composers2 = pd.read_excel('Datasets/composers.xlsx', index_col='composer',sheet_name='Sheet3')
composers2

To be concatenated, Dataframes need to be provided as a list:

all_composers = pd.concat([composers1,composers2])
# columns have the same -> ok

all_composers

One potential problem is that two tables contain duplicated information:

# be careful if same index !! can have twice the same index ! 
# ! ensure to not have the same index
all_composers.loc['Mahler']

It is very easy to get rid of it using:

all_composers.drop_duplicates()
# suppress duplicates from the table

4.2 Joining two tables¶

An other classical case is that of two list with similar index but containing different information, e.g.

composers1 = pd.read_excel('Datasets/composers.xlsx', index_col='composer',sheet_name='Sheet1')
composers1

composers2 = pd.read_excel('Datasets/composers.xlsx', index_col='composer',sheet_name='Sheet4')
composers2

If we we use again simple concatenation, this doesn't help us much. We just end up with a large matrix with lots of NaN's:

pd.concat([composers1, composers2])

/usr/local/lib/python3.5/dist-packages/ipykernel_launcher.py:1: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.

  """Entry point for launching an IPython kernel.

The better way of doing this is to join the tables. This is a classical database concept avaialble in Pandas.

join() operates on two tables: the first one is the "left" table which uses join() as a method. The other table is the "right" one.

Let's try the default join settings:

composers1.join(composers2)
# get all elements of the 1st table, merged with the 2nd table
# everything based on the left table (what from the 2nd table and is not in the 1st table is dropped)
# by default is left based

We see that Pandas was smart enough to notice that the two tables had a index name and used it to combine the tables. We also see that one element from the second table (Brahms) is missing. The reason for this is the way indices not present in both tables are handled. There are four ways of doing this with two tables called here the "left" and "right" table.

4.2.1. Join left¶

Here "left" and "right" just represent two Dataframes that should be merged. They have a common index, but not necessarily the same items. For example here Shostakovich is missing in the second table, while Brahms is missing in the first one. When using the "right" join, we use the first Dataframe as basis and only use the indices that appear there.

composers1.join(composers2, how = 'left') # this is the default

Hence Brahms is left out.

4.2.2. Join right¶

We can do the the opposite and use the indices of the second Dataframe as basis:

composers1.join(composers2, how = 'right')

Here we have Brahms but not Shostakovich.

4.2.3. Inner, outer¶

Finally, we can just say that we want to recover eihter only the items that appaer in both Dataframes (inner, like in a Venn diagram) or all the items (outer).

composers1.join(composers2, how = 'inner')
# => take all indices of both tables

composers1.join(composers2, how = 'outer')

4.3.4 Joining on columns : merge¶

Above we have used join to join based on indices. However sometimes tables don't have the same indices but similar contents that we want to merge. For example let's imagine whe have the two Dataframes below:

composers1 = pd.read_excel('Datasets/composers.xlsx', sheet_name='Sheet1')
composers2 = pd.read_excel('Datasets/composers.xlsx', sheet_name='Sheet6')

composers1

composers2

The indices don't match and are not the composer name. In addition the columns containing the composer names have different labels. Here we can use merge()and specify which columns we want to use for merging, and what type of merging we need (inner, left etc.)

# take left and right tables and can specify which column from each to perform the merge
pd.merge(composers1, composers2, left_on='composer', right_on='last name')

Again we can use another variety of join than the default inner:

pd.merge(composers1, composers2, left_on='composer', right_on='last name',how = 'outer')

pd.merge(composers1, composers2, left_on='composer', right_on='last name',how = 'right')

# MZ: to remove the column that now contains duplicated information:
dt1 = pd.merge(composers1, composers2, left_on='composer', right_on='last name')
dt1
# default is to drop rows, to drop columns set 1st axis
dt1.drop('last name', axis=1)
dt1
dt1.drop('last name', axis=1, inplace=True)
dt1

5.Splitting data¶

Often one has tables that mix regular variables (e.g. the size of cells in microscopy images) with categorical variables (e.g. the type of cell to which they belong). In that case, it is quite usual to split the data using the category to do computations. Pandas allows to do this very easily.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

5.1 Grouping¶

Let's import some data and have a look at them

composers = pd.read_excel('Datasets/composers.xlsx', sheet_name='Sheet5')

composers.head()

# MZ
# you don't have to explicitly go through the table and groupe elements
# simply use the 'groupby' function

5.1.1 Single level¶

What if we want now to count how many composers we have in each category? In classical computing we would maybe do a for loop to count occurrences. Pandas simplifies this with the groupby() function, which actually groups elements by a certain criteria, e.g. a categorical variable like the period:

composer_grouped = composers.groupby('period')
composer_grouped
# MZ: create new type of object from Pandas

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f5ff3cfbf60>

The output is a bit cryptic. What we actually have is a new object called a group which has a lot of handy properties. First let's see what the groups actually are. As for the Dataframe, let's look at a summary of the object:

composer_grouped.describe()
# MZ: get all the statistics by the groups created

# MZ: for example to see the different levels
composers.country.unique()

array(['Austria', 'Germany', 'Italy', 'Russia', 'Czechia', 'Finland',
       'France', 'RUssia', 'England', 'Belgium', 'Spain', 'USA'],
      dtype=object)

So we have a dataframe with a statistical summary of the the contents. The "names" of the groups are here the indices of the Dataframe. These names are simply all the different categories that were present in the column we used for grouping. Now we can recover a single group:

composer_grouped.get_group('baroque')

composer_grouped.get_group('post-romantic')

5.2.2 Multi-level¶

If one has multiple categorical variables, one can also do a grouping on several levels. For example here we want to classify composers both by period and country. For this we just give two column names to the groupby() function:

# MZ: groupping can be done on multiple columns
composer_grouped = composers.groupby(['period','country'])
composer_grouped.describe()

composer_grouped.get_group(('baroque','Germany'))

5.2 Operations on groups¶

The main advantage of this Group object is that it allows us to do very quickly both computations and plotting without having to loop through different categories. Indeed Pandas makes all the work for us: it applies functions on each group and then reassembles the results into a Dataframe (or Series depending on output).

For example we can apply most functions we used for Dataframes (mean, sum etc.) on groups as well and Pandas seamlessly does the work for us:

composer_grouped.mean()
# MZ: often you can directly apply the functions on the Pandas object

composer_grouped.count()

We can also design specific functions (again, like in the case of Dataframes) and apply them on groups:

def mult(ser):
    return ser.max() * 3

composer_grouped.apply(mult)
# MZ: most functions can be applied irrespectively of the object (DataFrame, group, Series, etc.)

/usr/local/lib/python3.5/dist-packages/pandas/core/computation/check.py:19: UserWarning: The installed version of numexpr 2.4.3 is not supported in pandas and will be not be used
The minimum supported version is 2.6.1

  ver=ver, min_ver=_MIN_NUMEXPR_VERSION), UserWarning)

5.3 Unstacking¶

Let's have a look again at one of our grouped Dataframe on which we applied some summary function like a mean on the age column:

composers['age']= composers['death']-composers['birth']

composers.groupby(['country','period']).age.mean()

country  period       
Austria  classic          56.000000
         modern           50.000000
         post-romantic    61.500000
Belgium  renaissance      69.500000
Czechia  classic          68.000000
         modern           74.000000
         romantic         61.500000
England  baroque          36.000000
         modern           81.000000
         renaissance      73.000000
Finland  post-romantic    92.000000
France   baroque          59.000000
         modern           87.500000
         romantic         70.333333
Germany  baroque          69.500000
         modern           87.000000
         post-romantic    85.000000
         romantic         59.250000
Italy    baroque          54.250000
         classic          52.000000
         post-romantic    66.000000
         renaissance      64.000000
         romantic         58.500000
RUssia   modern           62.000000
Russia   modern           79.000000
         romantic         48.000000
Spain    classic          52.000000
         romantic         49.000000
USA      modern           81.000000
Name: age, dtype: float64

Here we have two level of indices, with the main one being the country which contains all periods. Often for plotting we however need to have the information in another format. In particular we would like each of these values to be one observation in a regular table. For example we could have a country vs period table where all elements are the mean age. To do that we need to unstack our multi-level Dataframe:

# MZ: to obtain regular 2dim object
composer_unstacked = composers.groupby(['country','period']).age.mean().unstack()

composer_unstacked

5.4 Plotting groups¶

The possibility to create groups gives us also the opportunity to easily create interesting plots without writing too much code. For example we can caluclate the average age of composers in each period and plot it as a bar plot:

composers.groupby('period')['age'].mean().plot(kind = 'bar')
# MZ: group by period and plot the mean of the ages

<matplotlib.axes._subplots.AxesSubplot at 0x7f6024431278>

We can also use our unstacked table of country vs. period to automatically plot all average ages split by country and period:

composer_unstacked.plot(kind = 'bar');
# average age for each country and each period

There are much more powerful ways of using grouping-like features for plotting using the ggplot type grammar of graphics where objects can be grouped within an "aeasthetic". In the example above the "colour aesthetic" would e.g. be assigned to the period variable. Such an approach removes the need to do explicit groupings as done here.

6. Advanced plotting¶

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import seaborn as sns

We have seen already two options to plot data: we can use the "raw" Matplotlib which in principle allows one to create any possible plot, however with lots of code, and we saw the simpler internal Pandas solution. While the latter solution is very practical to quickly look through data, it is rather cumbersome to realise more complex plots.

Here we look at another type of plotting resting on the concepts of the grammar of graphics. This approach allows to create complex plots where data can be simply split in a plot into color, shapes etc. without having to do a grouping operation in beforehand. We will mainly look at Seaborn, and finish with an example with Plotnine, the port to Python of ggplot.

Importing data¶

We come back here to the dataset of swiss towns. To make the dataset more interestig we add to it some categorical data. First we attempt to add the main language for each town. It is a good example of the type of data wranglig one ofen has to do by combining information from different sources.

#load table indicating to which canton each town belongs
cantons = pd.read_excel('Datasets/be-b-00.04-osv-01.xls',sheet_name=1)[['KTKZ','ORTNAME']]

#load general table with infos on towns
towns = pd.read_excel('Datasets/2018.xls', skiprows=list(range(5))+list(range(6,9)),
                      skipfooter=34, index_col='Commune',na_values=['*','X'])
towns = towns.reset_index()

#merge tables using the town name. This adds the canton abbreviation to the main table 
towns_canton = pd.merge(towns, cantons, left_on='Commune', right_on='ORTNAME',how = 'inner')

#load data indicating languages of each canton
language = pd.read_excel('Datasets/je-f-01.08.01.02.xlsx',skiprows=[0,2,3,4],skipfooter=11)
languages = language[['Allemand (ou suisse allemand)','Français (ou patois romand)',
         'Italien (ou dialecte tessinois/italien des grisons)']]
languages = languages.apply(pd.to_numeric, errors='coerce')
#check which language has majority in each canton
languages['language'] = np.argmax(languages.values.astype(float),axis=1)
code={0:'German', 1:'French', 2:'Italian'}
languages['Language'] = languages.language.apply(lambda x: code[x])
languages['canton'] = language['Unnamed: 0']
languages = languages[['canton','Language']]

#load table matching canton name to abbreviation
cantons_abbrev = pd.read_excel('Datasets/cantons_abbrev.xlsx')
#add full canton name to table by merging on abbreviation
canton_language = pd.merge(languages, cantons_abbrev,on='canton')

#add language by merging on canton abbreviation
towns_language = pd.merge(towns_canton, canton_language, left_on='KTKZ', right_on='abbrev')

towns_language['town_type'] = towns_language['Surface agricole en %'].apply(lambda x: 'Land' if x<50 else 'City')

#Create a new party column and a new party score column
parties = pd.melt(towns_language,id_vars=['Commune'], value_vars=['UDC','PS','PDC'], 
                  var_name= 'Party', value_name='Party score')
towns_language = pd.merge(parties, towns_language, on='Commune')

towns_language

Basic plotting¶

We finally have a table with mostly numerical information but also two categorical data: language and town type (land or city). With Seaborn we can now easily make all sorts of plots. For example what are the average scores of the different parties:

sns.barplot(data = towns_language, y='Party score', x = 'Party');

/usr/local/lib/python3.5/dist-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

Do land towns vote more for the right-wing party ?

g = sns.scatterplot(data = towns_language, y='UDC', x = 'Surface agricole en %', s = 10, alpha = 0.5);
g.set_xlim([0,100]);

Using categories as "aesthetics"¶

The greate advantage of using these packages is that they allow to include categories as "aesthetics" of the plot. For example we looked before at average party scores. But are they different between language regions ? We can just specify that the hue (color) should be mapped to the town language:

sns.barplot(data = towns_language, y='Party score', x = 'Party', hue = 'Language');

/usr/local/lib/python3.5/dist-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

Similarly with scatter plots. Is the relation between land and voting on the right language dependent ?

g = sns.scatterplot(data = towns_language, y='UDC', x = 'Surface agricole en %', hue = 'Language',
                    s = 10, alpha = 0.5);
g.set_xlim([0,100]);

Statistics¶

We see difference in the last plot, but it is still to clearly see the relation. Luckiliy these packages allow us to either create summary statistics or to fit the data:

g = sns.lmplot(data = towns_language, x = 'Surface agricole en %', y='UDC', hue = 'Language', scatter=True,
              scatter_kws={'alpha': 0.1});
g.ax.set_xlim([0,100]);

/usr/local/lib/python3.5/dist-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

Now we can also do the same exercise for all parties. Does the relation hold?

g = sns.lmplot(data = towns_language, x = 'Surface agricole en %', y='Party score', 
               hue = 'Party', scatter=True,
              scatter_kws={'alpha': 0.1});
g.ax.set_xlim([0,100]);

/usr/local/lib/python3.5/dist-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

Adding eve more information¶

We can recover from some other place (Poste) the coordinates of each town. Again by merging we can add that information to our main table:

coords = pd.read_csv('Datasets/plz_verzeichnis_v2.csv', sep=';')[['ORTBEZ18','Geokoordinaten']]
coords['lat'] = coords.Geokoordinaten.apply(lambda x: float(x.split(', ')[0]) if type(x)==str else np.nan)
coords['long'] = coords.Geokoordinaten.apply(lambda x: float(x.split(', ')[1]) if type(x)==str else np.nan)

towns_language = pd.merge(towns_language,coords, left_on='Commune', right_on='ORTBEZ18')

So now we can in addition look at the geography of these parameters. For example, who votes for the right-wing party ?

fix, ax = plt.subplots(figsize = (12,8))
sns.scatterplot(data = towns_language, x= 'long', y = 'lat', hue='UDC', style = 'Language', palette='Reds');

# MZ: if used to ggplot -> use 'plotnine' package
# same grammar as ggplot

7. Insight into Machine Learning¶

Several more advanced applications rely on Pandas structure to work. One example is the package scikit-learn, which has become one of the dominant machine learning resource in data science. We are now going to have a very quick look at how Pandas is used in that frame.

We are going to work again with our swiss towns infos and we will see if we can predict the result of a party based on that information.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import seaborn as sns

7.1 Data loading and features selection¶

We first load the data set:

towns = pd.read_excel('Datasets/2018.xls', skiprows=list(range(5))+list(range(6,9)),
                      skipfooter=34, index_col='Commune',na_values=['*','X'])
towns = towns.reset_index()

Now we have to get select what featurs we are going to use to predict the vote for UDC. We have to remove the results of the other parties, as those are of course correlated.

We also create a target by selecting only the UDC column

features = towns.drop('PDC', axis=1)
features = features.drop('PS', axis=1)
features = features.drop('Commune', axis=1)
features = features.drop('PVL', axis=1)
features = features.drop('PLR 2)', axis=1)
features = features.drop('PBD', axis=1)
features = features.drop('PST/Sol.', axis=1)
features = features.drop('PEV/PCS', axis=1)
features = features.drop('PES', axis=1)
features = features.drop('Petits partis de droite', axis=1)
features = features.drop('Code commune', axis=1)

features = features.dropna()
targets = features['UDC']
features = features.drop('UDC', axis=1)

features.head()

targets.head()

0    30.929249
1    33.785785
2    29.100156
3    34.937369
4    30.114599
Name: UDC, dtype: float64

7.2 Splitting the data¶

We need to be able to test whether our ML algorithm is capable of making predictions on data is has not been trained on. We therefore split our dataset into a training and a testing set. Luckily sklearn provides this out of the box if we pass the right dataframes.

from sklearn.model_selection import train_test_split
X, X_test, y, y_test = train_test_split(features, targets, 
                                        test_size = 0.2, 
                                        random_state = 42)

len(X_test)/len(X)

0.25

7.3 Choosing an ML method¶

Sklearn offers a wide range of ML methods. We are not entering into details here and choose a Random Forest regression:

from sklearn.ensemble import RandomForestRegressor

Then we instantiate the model and train it (fit):

random_forrest = RandomForestRegressor(n_estimators=1000)
random_forrest.fit(X, y)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=1000, n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

Finally we can use it to make predictions. In particular we can apply it to our test sample and see how it performs:

predictions = random_forrest.predict(X_test)
mae = np.mean(abs(predictions - y_test))
print(mae)

7.092857784914048

/usr/local/lib/python3.5/dist-packages/pandas/core/computation/check.py:19: UserWarning: The installed version of numexpr 2.4.3 is not supported in pandas and will be not be used
The minimum supported version is 2.6.1

  ver=ver, min_ver=_MIN_NUMEXPR_VERSION), UserWarning)

sns.scatterplot(x = y_test, y = predictions);

import scipy.stats

scipy.stats.pearsonr(y_test,predictions)

(0.6900691335750526, 1.8302171574366992e-43)

7.4 Important features¶

A random forest classificer has the advantage that it can provide us information about how important each feature is. In other terms which features help the most in predicting:

print(random_forrest.feature_importances_)

[0.01642088 0.03087361 0.02929625 0.07733401 0.05875088 0.06434162
 0.02267082 0.03147107 0.02621289 0.02242997 0.02138494 0.01213671
 0.03465807 0.0244234  0.03648381 0.02957084 0.13884786 0.02812714
 0.02704483 0.02524017 0.01225019 0.02748501 0.01709968 0.01442152
 0.00932912 0.0553376  0.00979078 0.01229114 0.03309209 0.01542136
 0.03576174]

The larger the number, the better its predictions power. We can sort this list and see to what features they correspond in our feature Dataframe:

features.keys()[np.argsort(random_forrest.feature_importances_)]

Index(['Etablissements total', 'Secteur secondaire.1', 'Ménages privés',
       'Emplois total', 'Secteur tertiaire.1', 'Secteur tertiaire',
       'Nouveaux logements construits pour 1000 habitants', 'Habitants',
       'Secteur secondaire', 'Taux brut de mortalité', 'Taux brut de natalité',
       '65 ans ou plus', 'Surface totale en km²', 'Surface improductive en %',
       'Taux brut de divortialité', 'Surface boisée en %', 'Secteur primaire',
       'Variation en ha.1', 'Densité de la population par km²',
       'Variation en ha', 'Variation en %', 'Taux brut de nuptialité',
       'Taux de logements vacants', 'Taille moyenne des ménages en personnes',
       'Taux d'aide sociale', 'Surfaces d'habitat et d'infrastructure en %',
       'Secteur primaire.1', '0-19 ans', '20-64 ans', 'Etrangers en %',
       'Surface agricole en %'],
      dtype='object')

Finally, we can have a look at the actual correlations that seem to be indicated here. For this we select a few features and create a long format table for plotting:

towns_melt = pd.melt(towns, id_vars=['Commune','UDC'],
        value_vars=['Etrangers en %','Surface agricole en %','Taux brut de mortalité'])

sns.lmplot(data = towns_melt, x = 'value', y = 'UDC', hue = 'variable', scatter_kws={'alpha' : 0.1});

/usr/local/lib/python3.5/dist-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

There are indeed strong correlations where they are expected! Notice alos that we can learn things here: the right-wing party is most successful where there's the least foreigners...

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Exercise¶

For these exercices we are using a dataset provided by Airbnb for a Kaggle competition. It describes its offer for New York City in 2019, including types of appartments, price, location etc.

1. Create a dataframe¶

Create a dataframe of a few lines with objects and their poperties (e.g fruits, their weight and colour). Calculate the mean of your Dataframe.

dict_of_list = {'fruit_name': ["apple", "pear", "watermelon"], 'weight':[100, 94, 95], 'colour':['green', "yellow", "rosa"]}
fruits = pd.DataFrame(dict_of_list)

fruits.describe()
# calculates common statistical values
# and makes it only for the columns that make sense

fruits.mean()

weight    96.333333
dtype: float64

2. Import¶

Import the table called AB_NYC_2019.csv as a dataframe. It is located in the Datasets folder. Have a look at the beginning of the table (head).
Create a histogram of prices

mydata = pd.read_csv('Datasets/AB_NYC_2019.csv')
# mydata

plt.style.use('ggplot')
mydata['price'].plot.hist(alpha = 0.5)
plt.show()
# to have nicer plot (more bars)
mydata['price'].plot.hist(alpha = 0.5, bins=range(0,1000,10))
plt.show()

3. Operations¶

Create a new column in the dataframe by multiplying the "price" and "availability_365" columns to get an estimate of the maximum yearly income.

mydata['max_yearly_income'] = mydata['price'] * mydata['availability_365']

/usr/local/lib/python3.5/dist-packages/pandas/core/computation/check.py:19: UserWarning: The installed version of numexpr 2.4.3 is not supported in pandas and will be not be used
The minimum supported version is 2.6.1

  ver=ver, min_ver=_MIN_NUMEXPR_VERSION), UserWarning)

# what can be done with numpy can be done
# np.log(mydata['price'])

# mydata

3b. Subselection and plotting¶

Create a new Dataframe by first subselecting yearly incomes between 1 and 100'000. Then make a scatter plot of yearly income versus number of reviews

#mydata_sub = mydata[ (mydata['max_yearly_income'] >= 1) and (mydata['max_yearly_income'] <= 100000) ]
#mydata_sub = mydata[ (mydata.max_yearly_income >= 1) and (mydata.max_yearly_income <= 100000) ]
mydata_sub = mydata[ (mydata['max_yearly_income'] >= 1) & (mydata['max_yearly_income'] <= 100000) ].copy()
# mydata[(mydata.max_yearly_income>=1)&(mydata.max_yearly_income <= 100000)].copy()
# mydata_sub

mydata_sub.plot(x = 'number_of_reviews', y = 'max_yearly_income',kind = 'scatter')
max(mydata_sub['max_yearly_income'])

99900

4. Combine¶

We provide below an additional table that contains the number of inhabitants of each of New York's boroughs ("neighbourhood_group" in the table). Use merge to add this population information to each element in the original dataframe.

borough_dt = pd.read_excel('Datasets/ny_boroughs.xlsx')
#borough_dt

#mydata

merged_dt = pd.merge(mydata, borough_dt, left_on='neighbourhood_group', right_on='borough', how='left')
#merged_dt

5. Groups¶

Using groupby calculate the average price for each type of room (room_type) in each neighbourhood_group. What is the average price for an entire home in Brooklyn ?
Unstack the multi-level Dataframe into a regular Dataframe with unstack() and create a bar plot with the resulting table

merged_dt.groupby(['neighbourhood_group','room_type']).price.mean()

neighbourhood_group  room_type      
Bronx                Entire home/apt    127.506596
                     Private room        66.788344
                     Shared room         59.800000
Brooklyn             Entire home/apt    178.327545
                     Private room        76.500099
                     Shared room         50.527845
Manhattan            Entire home/apt    249.239109
                     Private room       116.776622
                     Shared room         88.977083
Queens               Entire home/apt    147.050573
                     Private room        71.762456
                     Shared room         69.020202
Staten Island        Entire home/apt    173.846591
                     Private room        62.292553
                     Shared room         57.444444
Name: price, dtype: float64

merged_dt.groupby(['neighbourhood_group','room_type']).price.mean()['Brooklyn']['Entire home/apt']

178.32754472225128

merged_dt.groupby(['neighbourhood_group','room_type'])['price'].mean()['Brooklyn']['Entire home/apt']

178.32754472225128

unstd_dt = merged_dt.groupby(['neighbourhood_group','room_type']).price.mean().unstack()
unstd_dt.plot(kind = 'bar');

6. Advanced plotting¶

fig, ax = plt.subplots(figsize=(10,8))
g = sns.scatterplot(data = merged_dt, y = 'latitude', x = 'longitude', hue = 'price',
                    hue_norm=(0,200), s=10, palette='inferno')

Using Seaborn, create a scatter plot where x and y positions are longitude and lattitude, the color reflects price and the shape of the marker the borough (neighbourhood_group). Can you recognize parts of new york ? Does the map make sense ?

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Exercise¶

For these exercices we are using a dataset provided by Airbnb for a Kaggle competition. It describes its offer for New York City in 2019, including types of appartments, price, location etc.

1. Create a dataframe¶

Create a dataframe of a few lines with objects and their poperties (e.g fruits, their weight and colour). Calculate the mean of your Dataframe.

fruits = pd.DataFrame({'fruits':['strawberry', 'orange','melon'], 'weight':[20, 200, 1000], 'color': ['red','orange','yellow']})

fruits

fruits.mean()

weight    406.666667
dtype: float64

2. Import¶

Import the table called AB_NYC_2019.csv as a dataframe. It is located in the Datasets folder. Have a look at the beginning of the table (head).
Create a histogram of prices

airbnb = pd.read_csv('Datasets/AB_NYC_2019.csv')

# airbnb.head()

airbnb.price.plot(kind = 'hist', bins = range(0,1000,10))

<matplotlib.axes._subplots.AxesSubplot at 0x7f4d11f5ef28>

3. Operations¶

Create a new column in the dataframe by multiplying the "price" and "availability_365" columns to get an estimate of the maximum yearly income.

airbnb['yearly_income'] = airbnb['price']*airbnb['availability_365']

/usr/local/lib/python3.5/dist-packages/pandas/core/computation/check.py:19: UserWarning: The installed version of numexpr 2.4.3 is not supported in pandas and will be not be used
The minimum supported version is 2.6.1

  ver=ver, min_ver=_MIN_NUMEXPR_VERSION), UserWarning)

# airbnb['yearly_income']

3b. Subselection and plotting¶

Create a new Dataframe by first subselecting yearly incomes between 1 and 100'000 and then by suppressing cases with 0 reviews. Then make a scatter plot of yearly income versus number of reviews

sub_airbnb = airbnb[(airbnb.yearly_income>1)&(airbnb.yearly_income<100000)].copy()

sub_airbnb.plot(x = 'number_of_reviews', y = 'price', kind = 'scatter', alpha = 0.01)
plt.show()

4. Combine¶

We provide below and additional table that contains the number of inhabitants of each of New York's boroughs ("neighbourhood_group" in the table). Use merge to add this population information to each element in the original dataframe.

boroughs = pd.read_excel('Datasets/ny_boroughs.xlsx')

boroughs

merged = pd.merge(airbnb, boroughs, left_on = 'neighbourhood_group', right_on='borough')

merged.head()

5. Groups¶

Using groupby calculate the average price for each type of room (room_type) in each neighbourhood_group. What is the average price for an entire home in Brooklyn ?
Unstack the multi-level Dataframe into a regular Dataframe with unstack() and create a bar plot with the resulting table

summary = airbnb.groupby(['neighbourhood_group','room_type']).mean().price

summary[('Brooklyn','Entire home/apt')]

178.32754472225128

summary.unstack().plot(kind = 'bar', alpha = 0.5)
plt.show()

6. Advanced plotting¶

Using Seaborn, create a scatter plot where x and y positions are longitude and lattitude, the color reflects price and the shape of the marker the borough (neighbourhood_group). Can you recognize parts of new york ? Does the map make sense ?

fig, ax = plt.subplots(figsize=(10,8))
g = sns.scatterplot(data = airbnb, y = 'latitude', x = 'longitude', hue = 'price',
                    hue_norm=(0,200), s=10, palette='inferno')

	Code commune	Habitants	Densité de la population par km²	Variation en %	Par mouvement migratoire	Par excédent naturel	Etrangers en %	0-19 ans	20-64 ans	>64 ans	...	PDC	PSS	UDC	PBD	PEV/PCS	PVL	PST/Sol.	PES	Petits partis de droite	Autres
Commune
Aeugst am Albis	1	1910	241	4.7	4.4	0.1	13.6	21.6	63.2	15.1	...	3.3	15.2	28.7	6.8	4.2	10.7	0.0	9.7	4.9	2.5
Affoltern am Albis	2	11160	1052	0.6	0.0	0.4	26.0	21.4	63.0	15.5	...	5.7	16.5	31.9	6.6	4.5	10.3	0.1	8.4	2.3	2.9
Bonstetten	3	5173	696	0.3	-1.0	0.9	12.7	23.9	63.5	12.6	...	3.7	18.2	27.1	5.7	3.3	17.0	0.1	7.5	1.6	3.1
Hausen am Albis	4	3356	246	-0.1	-0.4	0.3	12.4	23.8	61.3	14.9	...	3.3	15.6	33.5	6.7	2.2	14.2	0.0	11.4	1.7	1.7
Hedingen	5	3469	531	1.7	0.9	0.6	14.0	24.0	63.0	13.0	...	3.7	20.5	31.4	4.0	4.1	14.3	0.2	9.1	2.2	2.3

	birth	death	city
composer
Verdi	1813	1901	Roncole
Dvorak	1841	1904	Nelahozeves
Schumann	1810	1856	Zwickau
Stravinsky	1882	1971	Oranienbaum
Mahler	1860	1911	Kaliste

		birth								death
		count	mean	std	min	25%	50%	75%	max	count	mean	std	min	25%	50%	75%	max
period	country
baroque	England	1.0	1659.000000	NaN	1659.0	1659.00	1659.0	1659.00	1659.0	1.0	1695.000000	NaN	1695.0	1695.00	1695.0	1695.00	1695.0
	France	3.0	1650.666667	29.263174	1626.0	1634.50	1643.0	1663.00	1683.0	3.0	1709.666667	51.733290	1661.0	1682.50	1704.0	1734.00	1764.0
	Germany	2.0	1685.000000	0.000000	1685.0	1685.00	1685.0	1685.00	1685.0	2.0	1754.500000	6.363961	1750.0	1752.25	1754.5	1756.75	1759.0
	Italy	4.0	1663.000000	53.285395	1587.0	1649.25	1677.5	1691.25	1710.0	4.0	1717.250000	52.442826	1640.0	1712.00	1736.0	1741.25	1757.0
classic	Austria	2.0	1744.000000	16.970563	1732.0	1738.00	1744.0	1750.00	1756.0	2.0	1800.000000	12.727922	1791.0	1795.50	1800.0	1804.50	1809.0
	Czechia	1.0	1731.000000	NaN	1731.0	1731.00	1731.0	1731.00	1731.0	1.0	1799.000000	NaN	1799.0	1799.00	1799.0	1799.00	1799.0
	Italy	1.0	1749.000000	NaN	1749.0	1749.00	1749.0	1749.00	1749.0	1.0	1801.000000	NaN	1801.0	1801.00	1801.0	1801.00	1801.0
	Spain	1.0	1754.000000	NaN	1754.0	1754.00	1754.0	1754.00	1754.0	1.0	1806.000000	NaN	1806.0	1806.00	1806.0	1806.00	1806.0
modern	Austria	1.0	1885.000000	NaN	1885.0	1885.00	1885.0	1885.00	1885.0	1.0	1935.000000	NaN	1935.0	1935.00	1935.0	1935.00	1935.0
	Czechia	1.0	1854.000000	NaN	1854.0	1854.00	1854.0	1854.00	1854.0	1.0	1928.000000	NaN	1928.0	1928.00	1928.0	1928.00	1928.0
	England	2.0	1936.500000	48.790368	1902.0	1919.25	1936.5	1953.75	1971.0	1.0	1983.000000	NaN	1983.0	1983.00	1983.0	1983.00	1983.0
	France	2.0	1916.500000	12.020815	1908.0	1912.25	1916.5	1920.75	1925.0	2.0	2004.000000	16.970563	1992.0	1998.00	2004.0	2010.00	2016.0
	Germany	1.0	1895.000000	NaN	1895.0	1895.00	1895.0	1895.00	1895.0	1.0	1982.000000	NaN	1982.0	1982.00	1982.0	1982.00	1982.0
	RUssia	1.0	1891.000000	NaN	1891.0	1891.00	1891.0	1891.00	1891.0	1.0	1953.000000	NaN	1953.0	1953.00	1953.0	1953.00	1953.0
	Russia	2.0	1894.000000	16.970563	1882.0	1888.00	1894.0	1900.00	1906.0	2.0	1973.000000	2.828427	1971.0	1972.00	1973.0	1974.00	1975.0
	USA	3.0	1918.333333	18.502252	1900.0	1909.00	1918.0	1927.50	1937.0	2.0	1990.000000	0.000000	1990.0	1990.00	1990.0	1990.00	1990.0
post-romantic	Austria	2.0	1842.000000	25.455844	1824.0	1833.00	1842.0	1851.00	1860.0	2.0	1903.500000	10.606602	1896.0	1899.75	1903.5	1907.25	1911.0
	Finland	1.0	1865.000000	NaN	1865.0	1865.00	1865.0	1865.00	1865.0	1.0	1957.000000	NaN	1957.0	1957.00	1957.0	1957.00	1957.0
	Germany	1.0	1864.000000	NaN	1864.0	1864.00	1864.0	1864.00	1864.0	1.0	1949.000000	NaN	1949.0	1949.00	1949.0	1949.00	1949.0
	Italy	1.0	1858.000000	NaN	1858.0	1858.00	1858.0	1858.00	1858.0	1.0	1924.000000	NaN	1924.0	1924.00	1924.0	1924.00	1924.0
renaissance	Belgium	2.0	1464.500000	95.459415	1397.0	1430.75	1464.5	1498.25	1532.0	2.0	1534.000000	84.852814	1474.0	1504.00	1534.0	1564.00	1594.0
	England	2.0	1551.500000	16.263456	1540.0	1545.75	1551.5	1557.25	1563.0	2.0	1624.500000	2.121320	1623.0	1623.75	1624.5	1625.25	1626.0
	Italy	3.0	1552.666667	23.965253	1525.0	1545.50	1566.0	1566.50	1567.0	3.0	1616.666667	24.704925	1594.0	1603.50	1613.0	1628.00	1643.0
romantic	Czechia	2.0	1832.500000	12.020815	1824.0	1828.25	1832.5	1836.75	1841.0	2.0	1894.000000	14.142136	1884.0	1889.00	1894.0	1899.00	1904.0
	France	3.0	1821.000000	19.672316	1803.0	1810.50	1818.0	1830.00	1842.0	3.0	1891.333333	21.548395	1869.0	1881.00	1893.0	1902.50	1912.0
	Germany	4.0	1806.500000	26.388129	1770.0	1800.00	1811.5	1818.00	1833.0	4.0	1865.750000	30.934070	1827.0	1848.75	1869.5	1886.50	1897.0
	Italy	4.0	1817.250000	28.004464	1797.0	1800.00	1807.0	1824.25	1858.0	4.0	1875.750000	40.574006	1835.0	1844.75	1874.5	1905.50	1919.0
	Russia	2.0	1836.000000	4.242641	1833.0	1834.50	1836.0	1837.50	1839.0	2.0	1884.000000	4.242641	1881.0	1882.50	1884.0	1885.50	1887.0
	Spain	2.0	1863.500000	4.949747	1860.0	1861.75	1863.5	1865.25	1867.0	2.0	1912.500000	4.949747	1909.0	1910.75	1912.5	1914.25	1916.0

		birth	death
period	country
baroque	England	1659.000000	1695.000000
	France	1650.666667	1709.666667
	Germany	1685.000000	1754.500000
	Italy	1663.000000	1717.250000
classic	Austria	1744.000000	1800.000000
	Czechia	1731.000000	1799.000000
	Italy	1749.000000	1801.000000
	Spain	1754.000000	1806.000000
modern	Austria	1885.000000	1935.000000
	Czechia	1854.000000	1928.000000
	England	1936.500000	1983.000000
	France	1916.500000	2004.000000
	Germany	1895.000000	1982.000000
	RUssia	1891.000000	1953.000000
	Russia	1894.000000	1973.000000
	USA	1918.333333	1990.000000
post-romantic	Austria	1842.000000	1903.500000
	Finland	1865.000000	1957.000000
	Germany	1864.000000	1949.000000
	Italy	1858.000000	1924.000000
renaissance	Belgium	1464.500000	1534.000000
	England	1551.500000	1624.500000
	Italy	1552.666667	1616.666667
romantic	Czechia	1832.500000	1894.000000
	France	1821.000000	1891.333333
	Germany	1806.500000	1865.750000
	Italy	1817.250000	1875.750000
	Russia	1836.000000	1884.000000
	Spain	1863.500000	1912.500000

		composer	birth	death	period	country
period	country
baroque	England	PurcellPurcellPurcell	4977	5085.0	baroquebaroquebaroque	EnglandEnglandEngland
	France	RameauRameauRameau	5049	5292.0	baroquebaroquebaroque	FranceFranceFrance
	Germany	HaendelHaendelHaendel	5055	5277.0	baroquebaroquebaroque	GermanyGermanyGermany
	Italy	ScarlattiScarlattiScarlatti	5130	5271.0	baroquebaroquebaroque	ItalyItalyItaly
classic	Austria	MozartMozartMozart	5268	5427.0	classicclassicclassic	AustriaAustriaAustria
	Czechia	DusekDusekDusek	5193	5397.0	classicclassicclassic	CzechiaCzechiaCzechia
	Italy	CimarosaCimarosaCimarosa	5247	5403.0	classicclassicclassic	ItalyItalyItaly
	Spain	SolerSolerSoler	5262	5418.0	classicclassicclassic	SpainSpainSpain
modern	Austria	BergBergBerg	5655	5805.0	modernmodernmodern	AustriaAustriaAustria
	Czechia	JanacekJanacekJanacek	5562	5784.0	modernmodernmodern	CzechiaCzechiaCzechia
	England	WaltonWaltonWalton	5913	5949.0	modernmodernmodern	EnglandEnglandEngland
	France	MessiaenMessiaenMessiaen	5775	6048.0	modernmodernmodern	FranceFranceFrance
	Germany	OrffOrffOrff	5685	5946.0	modernmodernmodern	GermanyGermanyGermany
	RUssia	ProkofievProkofievProkofiev	5673	5859.0	modernmodernmodern	RUssiaRUssiaRUssia
	Russia	StravinskyStravinskyStravinsky	5718	5925.0	modernmodernmodern	RussiaRussiaRussia
	USA	GlassGlassGlass	5811	5970.0	modernmodernmodern	USAUSAUSA
post-romantic	Austria	MahlerMahlerMahler	5580	5733.0	post-romanticpost-romanticpost-romantic	AustriaAustriaAustria
	Finland	SibeliusSibeliusSibelius	5595	5871.0	post-romanticpost-romanticpost-romantic	FinlandFinlandFinland
	Germany	StraussStraussStrauss	5592	5847.0	post-romanticpost-romanticpost-romantic	GermanyGermanyGermany
	Italy	PucciniPucciniPuccini	5574	5772.0	post-romanticpost-romanticpost-romantic	ItalyItalyItaly
renaissance	Belgium	LassusLassusLassus	4596	4782.0	renaissancerenaissancerenaissance	BelgiumBelgiumBelgium
	England	DowlandDowlandDowland	4689	4878.0	renaissancerenaissancerenaissance	EnglandEnglandEngland
	Italy	PalestrinaPalestrinaPalestrina	4701	4929.0	renaissancerenaissancerenaissance	ItalyItalyItaly
romantic	Czechia	SmetanaSmetanaSmetana	5523	5712.0	romanticromanticromantic	CzechiaCzechiaCzechia
	France	MassenetMassenetMassenet	5526	5736.0	romanticromanticromantic	FranceFranceFrance
	Germany	WagnerWagnerWagner	5499	5691.0	romanticromanticromantic	GermanyGermanyGermany
	Italy	VerdiVerdiVerdi	5574	5757.0	romanticromanticromantic	ItalyItalyItaly
	Russia	MussorsgskyMussorsgskyMussorsgsky	5517	5661.0	romanticromanticromantic	RussiaRussiaRussia
	Spain	GranadosGranadosGranados	5601	5748.0	romanticromanticromantic	SpainSpainSpain

	Commune	Surface agricole en %	year	Party	Vote fraction
0	Aeugst am Albis	51.334180	2014	UDC	28.652058
1	Affoltern am Albis	40.094340	2014	UDC	31.894371
2	Bonstetten	55.436242	2014	UDC	27.095457
3	Hausen am Albis	55.774854	2014	UDC	33.535200
4	Hedingen	46.248086	2014	UDC	31.438683

	birth	city	death
Beethoven	1770	Bonn	1827
Mahler	1860	Kaliste	1911
Puccini	1858	Lucques	1924
Shostakovich	1906	Saint-Petersburg	1975

	first name
composer
Mahler	Gustav
Beethoven	Ludwig van
Puccini	Giacomo
Brahms	Johannes

	composer	birth	death	period	country
14	Haendel	1685	1759.0	baroque	Germany
16	Purcell	1659	1695.0	baroque	England
17	Charpentier	1643	1704.0	baroque	France
20	Couperin	1626	1661.0	baroque	France
21	Rameau	1683	1764.0	baroque	France
28	Caldara	1670	1736.0	baroque	Italy
29	Pergolesi	1710	1736.0	baroque	Italy
30	Scarlatti	1685	1757.0	baroque	Italy
31	Caccini	1587	1640.0	baroque	Italy
47	Bach	1685	1750.0	baroque	Germany

period	baroque	classic	modern	post-romantic	renaissance	romantic
country
Austria	NaN	56.0	50.0	61.5	NaN	NaN
Belgium	NaN	NaN	NaN	NaN	69.5	NaN
Czechia	NaN	68.0	74.0	NaN	NaN	61.500000
England	36.00	NaN	81.0	NaN	73.0	NaN
Finland	NaN	NaN	NaN	92.0	NaN	NaN
France	59.00	NaN	87.5	NaN	NaN	70.333333
Germany	69.50	NaN	87.0	85.0	NaN	59.250000
Italy	54.25	52.0	NaN	66.0	64.0	58.500000
RUssia	NaN	NaN	62.0	NaN	NaN	NaN
Russia	NaN	NaN	79.0	NaN	NaN	48.000000
Spain	NaN	52.0	NaN	NaN	NaN	49.000000
USA	NaN	NaN	81.0	NaN	NaN	NaN

	Commune	Party	Party score	Code commune	Habitants	Variation en %	Densité de la population par km²	Etrangers en %	0-19 ans	20-64 ans	...	PBD	PST/Sol.	PES	Petits partis de droite	KTKZ	ORTNAME	canton	Language	abbrev	town_type
0	Aeugst am Albis	UDC	30.929249	1	1977	8.388158	249.936789	13.100658	20.586748	62.822458	...	2.617442	0.167638	7.075094	4.888178	ZH	Aeugst am Albis	Zurich	German	ZH	City
1	Aeugst am Albis	PS	18.645940	1	1977	8.388158	249.936789	13.100658	20.586748	62.822458	...	2.617442	0.167638	7.075094	4.888178	ZH	Aeugst am Albis	Zurich	German	ZH	City
2	Aeugst am Albis	PDC	2.076428	1	1977	8.388158	249.936789	13.100658	20.586748	62.822458	...	2.617442	0.167638	7.075094	4.888178	ZH	Aeugst am Albis	Zurich	German	ZH	City
3	Affoltern am Albis	UDC	33.785785	2	11900	7.294203	1123.701605	27.848740	20.285714	62.201681	...	4.164299	0.190049	6.211047	1.768197	ZH	Affoltern am Albis	Zurich	German	ZH	Land
4	Affoltern am Albis	PS	19.080314	2	11900	7.294203	1123.701605	27.848740	20.285714	62.201681	...	4.164299	0.190049	6.211047	1.768197	ZH	Affoltern am Albis	Zurich	German	ZH	Land
5	Affoltern am Albis	PDC	4.585387	2	11900	7.294203	1123.701605	27.848740	20.285714	62.201681	...	4.164299	0.190049	6.211047	1.768197	ZH	Affoltern am Albis	Zurich	German	ZH	Land
6	Bonstetten	UDC	29.100156	3	5435	5.349874	731.493943	14.149034	23.808648	60.717571	...	3.803108	0.112518	6.661066	1.915807	ZH	Bonstetten	Zurich	German	ZH	City
7	Bonstetten	PS	20.403265	3	5435	5.349874	731.493943	14.149034	23.808648	60.717571	...	3.803108	0.112518	6.661066	1.915807	ZH	Bonstetten	Zurich	German	ZH	City
8	Bonstetten	PDC	3.378541	3	5435	5.349874	731.493943	14.149034	23.808648	60.717571	...	3.803108	0.112518	6.661066	1.915807	ZH	Bonstetten	Zurich	German	ZH	City
9	Hausen am Albis	UDC	34.937369	4	3571	6.279762	262.573529	14.533744	22.738729	60.403248	...	4.656087	0.193911	8.021665	1.825436	ZH	Hausen am Albis	Zurich	German	ZH	City
10	Hausen am Albis	PS	19.393305	4	3571	6.279762	262.573529	14.533744	22.738729	60.403248	...	4.656087	0.193911	8.021665	1.825436	ZH	Hausen am Albis	Zurich	German	ZH	City
11	Hausen am Albis	PDC	2.881915	4	3571	6.279762	262.573529	14.533744	22.738729	60.403248	...	4.656087	0.193911	8.021665	1.825436	ZH	Hausen am Albis	Zurich	German	ZH	City
12	Hedingen	UDC	30.114599	5	3687	8.123167	564.624809	14.971522	22.484405	62.110117	...	3.768864	0.227988	6.466387	1.840045	ZH	Hedingen	Zurich	German	ZH	Land
13	Hedingen	PS	22.478008	5	3687	8.123167	564.624809	14.971522	22.484405	62.110117	...	3.768864	0.227988	6.466387	1.840045	ZH	Hedingen	Zurich	German	ZH	Land
14	Hedingen	PDC	3.918166	5	3687	8.123167	564.624809	14.971522	22.484405	62.110117	...	3.768864	0.227988	6.466387	1.840045	ZH	Hedingen	Zurich	German	ZH	Land
15	Kappel am Albis	UDC	48.615099	6	1110	20.915033	140.151515	18.018018	26.486486	60.180180	...	5.134268	0.312447	4.382706	2.769802	ZH	Kappel am Albis	Zurich	German	ZH	City
16	Kappel am Albis	PS	10.285425	6	1110	20.915033	140.151515	18.018018	26.486486	60.180180	...	5.134268	0.312447	4.382706	2.769802	ZH	Kappel am Albis	Zurich	German	ZH	City
17	Kappel am Albis	PDC	2.744469	6	1110	20.915033	140.151515	18.018018	26.486486	60.180180	...	5.134268	0.312447	4.382706	2.769802	ZH	Kappel am Albis	Zurich	German	ZH	City
18	Knonau	UDC	32.876136	7	2168	20.444444	335.085008	17.158672	24.077491	60.885609	...	5.944968	0.008415	5.801919	3.437395	ZH	Knonau	Zurich	German	ZH	City
19	Knonau	PS	18.436553	7	2168	20.444444	335.085008	17.158672	24.077491	60.885609	...	5.944968	0.008415	5.801919	3.437395	ZH	Knonau	Zurich	German	ZH	City
20	Knonau	PDC	3.126052	7	2168	20.444444	335.085008	17.158672	24.077491	60.885609	...	5.944968	0.008415	5.801919	3.437395	ZH	Knonau	Zurich	German	ZH	City
21	Maschwanden	UDC	43.383446	8	626	1.623377	133.475480	12.140575	23.162939	59.744409	...	3.452833	0.644309	5.170990	2.114654	ZH	Maschwanden	Zurich	German	ZH	City
22	Maschwanden	PS	22.732529	8	626	1.623377	133.475480	12.140575	23.162939	59.744409	...	3.452833	0.644309	5.170990	2.114654	ZH	Maschwanden	Zurich	German	ZH	City
23	Maschwanden	PDC	3.502396	8	626	1.623377	133.475480	12.140575	23.162939	59.744409	...	3.452833	0.644309	5.170990	2.114654	ZH	Maschwanden	Zurich	German	ZH	City
24	Mettmenstetten	UDC	35.671015	9	4861	14.565166	373.062164	14.873483	22.341082	60.851677	...	4.352704	0.133017	5.059457	3.569025	ZH	Mettmenstetten	Zurich	German	ZH	City
25	Mettmenstetten	PS	18.800282	9	4861	14.565166	373.062164	14.873483	22.341082	60.851677	...	4.352704	0.133017	5.059457	3.569025	ZH	Mettmenstetten	Zurich	German	ZH	City
26	Mettmenstetten	PDC	3.649155	9	4861	14.565166	373.062164	14.873483	22.341082	60.851677	...	4.352704	0.133017	5.059457	3.569025	ZH	Mettmenstetten	Zurich	German	ZH	City
27	Obfelden	UDC	36.174029	10	5131	9.496372	680.503979	20.015591	23.055935	60.202690	...	4.358116	0.086903	3.705416	2.982453	ZH	Obfelden	Zurich	German	ZH	Land
28	Obfelden	PS	16.922138	10	5131	9.496372	680.503979	20.015591	23.055935	60.202690	...	4.358116	0.086903	3.705416	2.982453	ZH	Obfelden	Zurich	German	ZH	Land
29	Obfelden	PDC	6.488176	10	5131	9.496372	680.503979	20.015591	23.055935	60.202690	...	4.358116	0.086903	3.705416	2.982453	ZH	Obfelden	Zurich	German	ZH	Land
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
6318	Muriaux	UDC	14.609053	6753	504	3.703704	29.857820	7.341270	19.841270	60.515873	...	NaN	4.320988	8.641975	NaN	JU	Muriaux	Jura	French	JU	City
6319	Muriaux	PS	8.641975	6753	504	3.703704	29.857820	7.341270	19.841270	60.515873	...	NaN	4.320988	8.641975	NaN	JU	Muriaux	Jura	French	JU	City
6320	Muriaux	PDC	20.370370	6753	504	3.703704	29.857820	7.341270	19.841270	60.515873	...	NaN	4.320988	8.641975	NaN	JU	Muriaux	Jura	French	JU	City
6321	Le Noirmont	UDC	8.346334	6754	1845	10.877404	90.485532	15.880759	21.680217	61.517615	...	NaN	2.964119	7.332293	NaN	JU	Le Noirmont	Jura	French	JU	City
6322	Le Noirmont	PS	25.663027	6754	1845	10.877404	90.485532	15.880759	21.680217	61.517615	...	NaN	2.964119	7.332293	NaN	JU	Le Noirmont	Jura	French	JU	City
6323	Le Noirmont	PDC	15.834633	6754	1845	10.877404	90.485532	15.880759	21.680217	61.517615	...	NaN	2.964119	7.332293	NaN	JU	Le Noirmont	Jura	French	JU	City
6324	Saignelégier	UDC	9.322820	6757	2556	2.240000	80.707294	8.020344	21.244131	60.015649	...	NaN	5.287570	9.137291	NaN	JU	Saignelégier	Jura	French	JU	Land
6325	Saignelégier	PS	24.768089	6757	2556	2.240000	80.707294	8.020344	21.244131	60.015649	...	NaN	5.287570	9.137291	NaN	JU	Saignelégier	Jura	French	JU	Land
6326	Saignelégier	PDC	25.278293	6757	2556	2.240000	80.707294	8.020344	21.244131	60.015649	...	NaN	5.287570	9.137291	NaN	JU	Saignelégier	Jura	French	JU	Land
6327	Soubey	UDC	21.739130	6759	134	-8.843537	9.933284	2.238806	14.925373	56.716418	...	NaN	0.000000	8.695652	NaN	JU	Soubey	Jura	French	JU	Land
6328	Soubey	PS	15.217391	6759	134	-8.843537	9.933284	2.238806	14.925373	56.716418	...	NaN	0.000000	8.695652	NaN	JU	Soubey	Jura	French	JU	Land
6329	Soubey	PDC	42.028986	6759	134	-8.843537	9.933284	2.238806	14.925373	56.716418	...	NaN	0.000000	8.695652	NaN	JU	Soubey	Jura	French	JU	Land
6330	Alle	UDC	8.108108	6771	1817	7.769870	171.415094	10.621904	24.766098	54.980737	...	NaN	1.719902	4.176904	NaN	JU	Alle	Jura	French	JU	City
6331	Alle	PS	14.557740	6771	1817	7.769870	171.415094	10.621904	24.766098	54.980737	...	NaN	1.719902	4.176904	NaN	JU	Alle	Jura	French	JU	City
6332	Alle	PDC	48.341523	6771	1817	7.769870	171.415094	10.621904	24.766098	54.980737	...	NaN	1.719902	4.176904	NaN	JU	Alle	Jura	French	JU	City
6333	Beurnevésin	UDC	16.279070	6773	127	-8.633094	24.950884	6.299213	12.598425	48.818898	...	NaN	3.100775	6.201550	NaN	JU	Beurnevésin	Jura	French	JU	City
6334	Beurnevésin	PS	6.976744	6773	127	-8.633094	24.950884	6.299213	12.598425	48.818898	...	NaN	3.100775	6.201550	NaN	JU	Beurnevésin	Jura	French	JU	City
6335	Beurnevésin	PDC	48.062016	6773	127	-8.633094	24.950884	6.299213	12.598425	48.818898	...	NaN	3.100775	6.201550	NaN	JU	Beurnevésin	Jura	French	JU	City
6336	Boncourt	UDC	8.318099	6774	1205	-7.307692	133.740289	10.539419	17.925311	52.614108	...	NaN	0.548446	2.285192	NaN	JU	Boncourt	Jura	French	JU	Land
6337	Boncourt	PS	22.760512	6774	1205	-7.307692	133.740289	10.539419	17.925311	52.614108	...	NaN	0.548446	2.285192	NaN	JU	Boncourt	Jura	French	JU	Land
6338	Boncourt	PDC	44.972578	6774	1205	-7.307692	133.740289	10.539419	17.925311	52.614108	...	NaN	0.548446	2.285192	NaN	JU	Boncourt	Jura	French	JU	Land
6339	Bonfol	UDC	21.452703	6775	665	-2.349486	48.969072	9.172932	15.789474	53.984962	...	NaN	2.364865	6.418919	NaN	JU	Bonfol	Jura	French	JU	Land
6340	Bonfol	PS	15.371622	6775	665	-2.349486	48.969072	9.172932	15.789474	53.984962	...	NaN	2.364865	6.418919	NaN	JU	Bonfol	Jura	French	JU	Land
6341	Bonfol	PDC	31.418919	6775	665	-2.349486	48.969072	9.172932	15.789474	53.984962	...	NaN	2.364865	6.418919	NaN	JU	Bonfol	Jura	French	JU	Land
6342	Bure	UDC	5.503356	6778	685	3.162651	50.036523	6.131387	19.854015	54.306569	...	NaN	0.671141	1.208054	NaN	JU	Bure	Jura	French	JU	Land
6343	Bure	PS	8.859060	6778	685	3.162651	50.036523	6.131387	19.854015	54.306569	...	NaN	0.671141	1.208054	NaN	JU	Bure	Jura	French	JU	Land
6344	Bure	PDC	49.395973	6778	685	3.162651	50.036523	6.131387	19.854015	54.306569	...	NaN	0.671141	1.208054	NaN	JU	Bure	Jura	French	JU	Land
6345	Coeuve	UDC	8.194444	6781	732	8.124077	62.994836	3.551913	25.136612	56.147541	...	NaN	2.222222	5.000000	NaN	JU	Coeuve	Jura	French	JU	City
6346	Coeuve	PS	24.722222	6781	732	8.124077	62.994836	3.551913	25.136612	56.147541	...	NaN	2.222222	5.000000	NaN	JU	Coeuve	Jura	French	JU	City
6347	Coeuve	PDC	36.527778	6781	732	8.124077	62.994836	3.551913	25.136612	56.147541	...	NaN	2.222222	5.000000	NaN	JU	Coeuve	Jura	French	JU	City

	weight
count	3.000000
mean	96.333333
std	3.214550
min	94.000000
25%	94.500000
50%	95.000000
75%	97.500000
max	100.000000

	borough	population
0	Brooklyn	2648771
1	Manhattan	1664727
2	Queens	2358582
3	Staten Island	479458
4	Bronx	1471160

	id	name	host_id	host_name	neighbourhood_group	neighbourhood	latitude	longitude	room_type	price	minimum_nights	number_of_reviews	last_review	reviews_per_month	calculated_host_listings_count	availability_365	yearly_income	borough	population
0	2539	Clean & quiet apt home by the park	2787	John	Brooklyn	Kensington	40.64749	-73.97237	Private room	149	1	9	2018-10-19	0.21	6	365	54385	Brooklyn	2648771
1	3831	Cozy Entire Floor of Brownstone	4869	LisaRoxanne	Brooklyn	Clinton Hill	40.68514	-73.95976	Entire home/apt	89	1	270	2019-07-05	4.64	1	194	17266	Brooklyn	2648771
2	5121	BlissArtsSpace!	7356	Garon	Brooklyn	Bedford-Stuyvesant	40.68688	-73.95596	Private room	60	45	49	2017-10-05	0.40	1	0	0	Brooklyn	2648771
3	5803	Lovely Room 1, Garden, Best Area, Legal rental	9744	Laurie	Brooklyn	South Slope	40.66829	-73.98779	Private room	89	4	167	2019-06-24	1.34	3	314	27946	Brooklyn	2648771
4	6848	Only 2 stops to Manhattan studio	15991	Allen & Irina	Brooklyn	Williamsburg	40.70837	-73.95352	Entire home/apt	140	2	148	2019-06-29	1.20	1	46	6440	Brooklyn	2648771