In this exercise, we will use the ‘Airbnb’ dataset we used in class to continue to familiarize ourselves with regressions and data visualization in R. If you already didn’t do so, you can download the dataset here.
Let’s start by loading the libraries, setting the working directory, and reading the dataset.
#load libraries
library(data.table)
library(scales)
library(ggplot2)
library(stargazer)
setwd("/Users/dproserp/Library/CloudStorage/Dropbox/teaching/mkt566-2025/w4/airbnb-case")
airbnb = fread("w4-airbnb-case.csv.gz")
head(airbnb)
nrow(airbnb[price==0])
## [1] 0
SOLUTION: Using log(price) instead of price makes the relationship with number of reviews more linear. Using log(price) and log(reviews_count) makes the relationship even more linear.
#plot log price vs number of reviews
ggplot(data = airbnb) +
geom_point(mapping = aes(x = reviews_count, y = log(price)), alpha = 0.5) +
labs(title = "Log Price vs Number of Reviews", x = "Number of Reviews", y = "Log Price") +
scale_x_continuous(breaks = seq(0, max(airbnb$reviews_count, na.rm = TRUE), 100)) +
scale_y_continuous(labels = dollar_format()) +
geom_smooth(mapping = aes(x = reviews_count, y = log(price)), method = "lm", se = FALSE, color = "blue") +
theme_minimal()
#plot log-log
ggplot(data = airbnb) +
geom_point(mapping = aes(x = log(reviews_count), y = log(price)), alpha = 0.5) +
labs(title = "Log Price vs Log Number of Reviews", x = "Log Number of Reviews", y = "Log Price") +
scale_x_continuous(breaks = seq(0, max(log(airbnb$reviews_count + 1), na.rm = TRUE), 0.5)) +
scale_y_continuous(labels = dollar_format()) +
geom_smooth(mapping = aes(x = log(reviews_count + 1), y = log(price)), method = "lm", se = FALSE, color = "blue") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
# level-level
m1 = lm(price ~ reviews_count, data = airbnb)
# log-level
m2 = lm(log(price) ~ reviews_count, data = airbnb)
#log-log
m3 = lm(log(price) ~ log(reviews_count), data = airbnb)
# print estimates with the three models
stargazer(m1, m2, m3, type = "text", title = "Regression of Price on Number of Reviews",
dep.var.labels = c("Price", "log Price", "log Price"),
omit.stat = c("f", "ser", "adj.rsq"), digits = 3)
##
## Regression of Price on Number of Reviews
## =================================================
## Dependent variable:
## ------------------------------
## Price log Price
## (1) (2) (3)
## -------------------------------------------------
## reviews_count -0.347*** -0.001***
## (0.032) (0.0001)
##
## log(reviews_count) -0.019***
## (0.003)
##
## Constant 148.041*** 4.685*** 4.718***
## (0.901) (0.004) (0.008)
##
## -------------------------------------------------
## Observations 50,836 50,836 50,836
## R2 0.002 0.0004 0.001
## =================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
SOLUTION: In model 1, the coefficient of reviews_count is the change in price for a one unit increase in reviews_count. In model 2, it is a percentage change in price for a one unit increase in reviews_count. In model 3, it is the percentage change in price for a one percent increase in reviews_count (elasticit)
SOLUTION: The R-squared are 0.002, 0.0004, and 0.001. They are very small suggesting that the number of reviews alone does not explain much of the variation in prices.
SOLUTION: R-squared increases to 0.33, suggesting that the additional variables help explain more of the variation in prices. The coefficients make intuitive sense: higher star ratings, more bathrooms, more bedrooms, and more guests included are all associated with higher prices.
m4 = lm(price ~ reviews_count + star_rating + bathrooms + bedrooms + guests_included, data = airbnb)
stargazer(m4, type = "text",
dep.var.labels = c("Price"),
covariate.labels = c("Number of Reviews", "Star Rating", "Bathrooms", "Bedrooms", "Guests", "Constant"),
omit.stat = c("f", "ser", "adj.rsq"), digits = 3)
##
## =============================================
## Dependent variable:
## ---------------------------
## Price
## ---------------------------------------------
## Number of Reviews -0.180***
## (0.027)
##
## Star Rating 19.366***
## (1.489)
##
## Bathrooms 62.547***
## (1.233)
##
## Bedrooms 57.743***
## (0.937)
##
## Guests 12.309***
## (0.453)
##
## Constant -124.594***
## (7.142)
##
## ---------------------------------------------
## Observations 50,575
## R2 0.329
## =============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
SOLUTION: You get four estimates, one for each city excluding the baseline city. The coefficients represent the difference in price relative to the reference city (the first one alphabetically in this case). The most expensive city is Boston and the least expensive is Miami.
m5 = lm(price ~ reviews_count + star_rating + bathrooms + bedrooms + guests_included + city, data = airbnb)
stargazer(m5, type = "text",
dep.var.labels = c("Price"),
omit.stat = c("f", "ser", "adj.rsq"), digits = 3)
##
## =============================================
## Dependent variable:
## ---------------------------
## Price
## ---------------------------------------------
## reviews_count -0.192***
## (0.027)
##
## star_rating 19.680***
## (1.494)
##
## bathrooms 64.293***
## (1.235)
##
## bedrooms 56.506***
## (0.944)
##
## guests_included 12.613***
## (0.452)
##
## cityBoston 22.302***
## (2.220)
##
## cityLos Angeles -7.236***
## (1.599)
##
## cityMiami -18.059***
## (2.145)
##
## cityNew York City -10.768
## (32.046)
##
## Constant -123.468***
## (7.418)
##
## ---------------------------------------------
## Observations 50,575
## R2 0.334
## =============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
SOLUTION: The results are the same as before, confirming that including the city variable as a factor or as individual dummy variables yields the same estimates.
airbnb[, austin := ifelse(city == "Austin", 1, 0)]
airbnb[, boston := ifelse(city == "Boston", 1, 0)]
airbnb[, la := ifelse(city == "Los Angeles", 1, 0)]
airbnb[, miami := ifelse(city == "Miami", 1, 0)]
airbnb[, nyc := ifelse(city == "New York City", 1, 0)]
m5b = lm(price ~ reviews_count + star_rating + bathrooms + bedrooms + guests_included + boston + la + miami + nyc, data = airbnb)
stargazer(m5b, type = "text",
dep.var.labels = c("Price"),
omit.stat = c("f", "ser", "adj.rsq"), digits = 3)
##
## ===========================================
## Dependent variable:
## ---------------------------
## Price
## -------------------------------------------
## reviews_count -0.192***
## (0.027)
##
## star_rating 19.680***
## (1.494)
##
## bathrooms 64.293***
## (1.235)
##
## bedrooms 56.506***
## (0.944)
##
## guests_included 12.613***
## (0.452)
##
## boston 22.302***
## (2.220)
##
## la -7.236***
## (1.599)
##
## miami -18.059***
## (2.145)
##
## nyc -10.768
## (32.046)
##
## Constant -123.468***
## (7.418)
##
## -------------------------------------------
## Observations 50,575
## R2 0.334
## ===========================================
## Note: *p<0.1; **p<0.05; ***p<0.01
SOLUTION: New York City has the highest coefficient among the five cities but it is not statistically significant. The highest statistically significant coefficient is Boston suggesting is the city generating the highest expected revenue.
airbnb[, revenue := price * reviews_count]
m6 = lm(revenue ~ star_rating + bathrooms + bedrooms + guests_included + city, data = airbnb)
stargazer(m6, type = "text",
dep.var.labels = c("Revenue"),
omit.stat = c("f", "ser", "adj.rsq"), digits = 3)
##
## =============================================
## Dependent variable:
## ---------------------------
## Revenue
## ---------------------------------------------
## star_rating 685.471***
## (38.395)
##
## bathrooms 285.919***
## (31.780)
##
## bedrooms 575.910***
## (24.278)
##
## guests_included 354.705***
## (11.617)
##
## cityBoston 664.073***
## (57.145)
##
## cityLos Angeles 228.071***
## (41.162)
##
## cityMiami -463.726***
## (55.233)
##
## cityNew York City 1,353.149
## (825.198)
##
## Constant -3,008.092***
## (191.009)
##
## ---------------------------------------------
## Observations 50,575
## R2 0.092
## =============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
SOLUTION: The baseline is “Entire home” so the coefficients represent the difference in expected revenue compared to the “Entire home”. “Entire home” generates the most revenue, followed by “Private room”, and “Shared room” generates the least revenue.
m6b = lm(revenue ~ star_rating + bathrooms + bedrooms + guests_included + city + room_type, data = airbnb)
stargazer(m6b, type = "text",
dep.var.labels = c("Revenue"),
omit.stat = c("f", "ser", "adj.rsq"), digits = 3)
##
## =================================================
## Dependent variable:
## ---------------------------
## Revenue
## -------------------------------------------------
## star_rating 629.225***
## (38.203)
##
## bathrooms 383.362***
## (31.882)
##
## bedrooms 471.605***
## (24.259)
##
## guests_included 275.684***
## (11.757)
##
## cityBoston 697.396***
## (56.591)
##
## cityLos Angeles 242.241***
## (40.785)
##
## cityMiami -415.071***
## (54.690)
##
## cityNew York City 1,487.487*
## (816.788)
##
## room_typePrivate room -957.110***
## (35.208)
##
## room_typeShared room -1,671.282***
## (74.877)
##
## Constant -2,231.803***
## (190.719)
##
## -------------------------------------------------
## Observations 50,575
## R2 0.111
## =================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
SOLUTION: The coefficient of highQuality indicates the increase in expected revenue for listings with a star rating of 4.5 or higher compared to those with lower ratings, holding other factors constant. The coefficient suggests that quality, as measured by star rating, has a significant positive impact on revenue.
airbnb[, highQuality := ifelse(star_rating >= 4.5, 1, 0)]
m7 = lm(revenue ~ highQuality + bathrooms + bedrooms + guests_included + city + room_type, data = airbnb)
stargazer(m7, type = "text",
dep.var.labels = c("Revenue"),
omit.stat = c("f", "ser", "adj.rsq"), digits = 3)
##
## =================================================
## Dependent variable:
## ---------------------------
## Revenue
## -------------------------------------------------
## highQuality 904.999***
## (51.761)
##
## bathrooms 384.846***
## (31.872)
##
## bedrooms 473.360***
## (24.249)
##
## guests_included 275.648***
## (11.753)
##
## cityBoston 671.526***
## (56.425)
##
## cityLos Angeles 225.389***
## (40.679)
##
## cityMiami -433.029***
## (54.598)
##
## cityNew York City 1,503.310*
## (816.514)
##
## room_typePrivate room -950.500***
## (35.193)
##
## room_typeShared room -1,680.258***
## (74.766)
##
## Constant -69.590
## (71.567)
##
## -------------------------------------------------
## Observations 50,575
## R2 0.111
## =================================================
## Note: *p<0.1; **p<0.05; ***p<0.01