import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
For these exercices we are using a dataset provided by Airbnb for a Kaggle competition. It describes its offer for New York City in 2019, including types of appartments, price, location etc.
Create a dataframe of a few lines with objects and their poperties (e.g fruits, their weight and colour). Calculate the mean of your Dataframe.
dict_of_list = {'fruit_name': ["apple", "pear", "watermelon"], 'weight':[100, 94, 95], 'colour':['green', "yellow", "rosa"]}
fruits = pd.DataFrame(dict_of_list)
fruits.describe()
# calculates common statistical values
# and makes it only for the columns that make sense
fruits.mean()
Import the table called AB_NYC_2019.csv
as a dataframe. It is located in the Datasets folder. Have a look at the beginning of the table (head).
Create a histogram of prices
mydata = pd.read_csv('Datasets/AB_NYC_2019.csv')
# mydata
plt.style.use('ggplot')
mydata['price'].plot.hist(alpha = 0.5)
plt.show()
# to have nicer plot (more bars)
mydata['price'].plot.hist(alpha = 0.5, bins=range(0,1000,10))
plt.show()
Create a new column in the dataframe by multiplying the "price" and "availability_365" columns to get an estimate of the maximum yearly income.
mydata['max_yearly_income'] = mydata['price'] * mydata['availability_365']
# what can be done with numpy can be done
# np.log(mydata['price'])
# mydata
Create a new Dataframe by first subselecting yearly incomes between 1 and 100'000. Then make a scatter plot of yearly income versus number of reviews
#mydata_sub = mydata[ (mydata['max_yearly_income'] >= 1) and (mydata['max_yearly_income'] <= 100000) ]
#mydata_sub = mydata[ (mydata.max_yearly_income >= 1) and (mydata.max_yearly_income <= 100000) ]
mydata_sub = mydata[ (mydata['max_yearly_income'] >= 1) & (mydata['max_yearly_income'] <= 100000) ].copy()
# mydata[(mydata.max_yearly_income>=1)&(mydata.max_yearly_income <= 100000)].copy()
# mydata_sub
mydata_sub.plot(x = 'number_of_reviews', y = 'max_yearly_income',kind = 'scatter')
max(mydata_sub['max_yearly_income'])
We provide below an additional table that contains the number of inhabitants of each of New York's boroughs ("neighbourhood_group" in the table). Use merge
to add this population information to each element in the original dataframe.
borough_dt = pd.read_excel('Datasets/ny_boroughs.xlsx')
#borough_dt
#mydata
merged_dt = pd.merge(mydata, borough_dt, left_on='neighbourhood_group', right_on='borough', how='left')
#merged_dt
groupby
calculate the average price for each type of room (room_type) in each neighbourhood_group. What is the average price for an entire home in Brooklyn ?unstack()
and create a bar plot with the resulting tablemerged_dt.groupby(['neighbourhood_group','room_type']).price.mean()
merged_dt.groupby(['neighbourhood_group','room_type']).price.mean()['Brooklyn']['Entire home/apt']
merged_dt.groupby(['neighbourhood_group','room_type'])['price'].mean()['Brooklyn']['Entire home/apt']
unstd_dt = merged_dt.groupby(['neighbourhood_group','room_type']).price.mean().unstack()
unstd_dt.plot(kind = 'bar');
fig, ax = plt.subplots(figsize=(10,8))
g = sns.scatterplot(data = merged_dt, y = 'latitude', x = 'longitude', hue = 'price',
hue_norm=(0,200), s=10, palette='inferno')
Using Seaborn, create a scatter plot where x and y positions are longitude and lattitude, the color reflects price and the shape of the marker the borough (neighbourhood_group). Can you recognize parts of new york ? Does the map make sense ?