--- title: "Lab 001" subtitle: "Data cleaning" author: "Andrew Dickinson" date:
`r format(Sys.time(), '%d %B %Y')` header-includes: - \usepackage{mathtools} - \DeclarePairedDelimiter\floor{\lfloor}{\rfloor} - \usepackage{amssymb} output: html_document: toc: false toc_depth: 3 number_sections: false theme: flatly highlight: tango toc_float: collapsed: true smooth_scroll: true --- ```{r setup, include=FALSE} ## These next lines set the default behavior for all R chunks in the .Rmd document. ## I recommend you take a look here: https://rmarkdown.rstudio.com/authoring_rcodechunks.html knitr::opts_chunk$set(echo = TRUE) ## Show all R output knitr::opts_chunk$set(cache = FALSE) ## Cache the results to increase performance. knitr::opts_chunk$set(warning = FALSE) ## Limit warnings knitr::opts_chunk$set(message = FALSE) ## Limit warnings ``` ```{r, echo=FALSE} pacman::p_load(tidyverse) ``` ```{r, echo=FALSE} # Setup ---------------------------------------------------------------------------------- # Options options(stringsAsFactors = F) # Packages # devtools::install_github("tidymodels/parsnip") pacman::p_load( tidyverse, data.table, lubridate, ranger, parsnip, magrittr, here ) # Load data ------------------------------------------------------------------------------ # Training data train_dt = here('data', 'train.csv') %>% fread() # Testing data test_dt = here('data', 'test.csv') %>% fread() # Data work ------------------------------------------------------------------------------ # Replace "NA" in alley with "No" train_dt[is.na(Alley), Alley := "No"] test_dt[is.na(Alley), Alley := "No"] # Same with fence train_dt[is.na(Fence), Fence := "No"] test_dt[is.na(Fence), Fence := "No"] # and MSZoning train_dt[is.na(MSZoning), MSZoning := "No"] test_dt[is.na(MSZoning), MSZoning := "No"] # and Utilities train_dt[is.na(Utilities), Utilities := "No"] test_dt[is.na(Utilities), Utilities := "No"] # and MiscFeature train_dt[is.na(MiscFeature), MiscFeature := "No"] test_dt[is.na(MiscFeature), MiscFeature := "No"] # and LotFrontage train_dt[is.na(LotFrontage), LotFrontage := 0] test_dt[is.na(LotFrontage), LotFrontage := 0] # and MasVnrArea train_dt[is.na(MasVnrArea), MasVnrArea := 0] test_dt[is.na(MasVnrArea), MasVnrArea := 0] # and MasVnrType train_dt[is.na(MasVnrType), MasVnrType := "None"] test_dt[is.na(MasVnrType), MasVnrType := "None"] # and SaleType train_dt[is.na(SaleType), SaleType := "?"] test_dt[is.na(SaleType), SaleType := "?"] # and Exterior1st train_dt[is.na(Exterior1st), Exterior1st := "?"] test_dt[is.na(Exterior1st), Exterior1st := "?"] # and Exterior2nd train_dt[is.na(Exterior2nd), Exterior2nd := "?"] test_dt[is.na(Exterior2nd), Exterior2nd := "?"] # and KitchenQual train_dt[is.na(KitchenQual), KitchenQual := "?"] test_dt[is.na(KitchenQual), KitchenQual := "?"] # Drop PoolQC train_dt[, PoolQC := NULL] test_dt[, PoolQC := NULL] # Drop FireplaceQu train_dt[, FireplaceQu := NULL] test_dt[, FireplaceQu := NULL] ``` # {.tabset .tabset-fade .tabset-pills} ## Introduction In an effort to make this lab as useful as possible, I am going to change things up a little bit from last week. - Going to try to move away from lecture slides, standing in the front of the room with a clicker - Be more active; live coding; working with data (that you are using in your projects) in front of you all - Sit down and code more rather talk about it - Give more tips and tricks; provide examples and code snippets As the quarter progresses, I hope to spend more time providing materials, tips/tricks, methods etc. for helping you all finish the projects and understand how to code what we are doing in class. I will continue to make adjustments to lab to make this time more productive- any feedback is greatly appreciated. Furthermore, I am keeping a list of "miscellaneous" topics to go over that I have found super helpful. Let me know if there are any particular topics you all are interested in! #### Last week We talked about: - Rstudio - Projects in Rstudio - Pipe operators - `dplyr` verbs (ran out of time for most of these) - `select()` - `filter()` - `arrange()` - `mutate()` - `group_by` + `summarize()` This week I want to apply these topics with an actual project- __project-000__ ## Outline (i.) Setup project-000 - File management - Using `here()` - Writing scripts (ii.) Question 03 _"Get a feel for the data- graphs, summaries, etc"_ - View() - skim() - clean_names() - Using pipes, `dpylr` and `ggplot2` (iii.) Coding errors - Erik's coding error - Namespace conflicts - How to google for help ## Setup project-000 __First__ open up Rstudio and close any projects; start a new project - I recommend a new Rstudio project for each project in this class _Note:_ Having a __clean__ and __organized__ file system is extremely important! Here's how I would set up my filesystem: `Home > Documents > classes > prediction > projects > project-000` Within `project-000` I would create two folders, one for r scripts/markdown called `R/` and one for data called `data/` I have posted the a [link](https://github.com/edrubin/EC524W22/raw/master/lab/001-cleaning/data/house-prices-advanced-regression-techniques.zip) to a `.zip` file of the housing data on the `README.md`. Download this to your `data/` folder and unzip it For simplicity I copied Ed's `data.table()` code from from Connor Lennon's [Rpubs page](https://rpubs.com/Clennon/KagNotes) page on kaggle notebooks on a new `R` script _Note: Let me know if you guys are interested in learning about `data.table`!!_ ## "Get a feel for the data" Question 03 (bullet 2) says "Get a feel for the data- graphs, summaries, etc" __I cannot stress enough how important it is that you do this with each new dataset you come across!__ Before you ever run a regression: - plot the data to look for patterns - Look at the data set using `View()`, `glimpse()`, or `skimr::skim()`
#### Sidebar: Codebooks Codebooks are extremely useful for understanding variables and what the heck the variable names mean. Read the codebook always. - I like to print them out and tape them to my wall
Let's use `dplyr` and `ggplot2` to plot the data using pipe operators! In class example of how to use these two packages to analyze different cuts of housing data _Note: this is really simple visualization with little forethought.. Just a showcase of how to use these packages together!_ ```{r, fig.width=10, fig.height=6} train_dt_new = train_dt %>% janitor::clean_names('upper_camel') train_dt_new %>% rename(FirstFloorSF = X1StFlrSf, SecondFloorSF = X2NdFlrSf) %>% mutate(TotalSqFt = FirstFloorSF + SecondFloorSF, TotalSqFt_binned = round(TotalSqFt, digits = -2)) %>% group_by(TotalSqFt_binned, YearBuilt) %>% summarize(MeanSalePrice = mean(SalePrice), MeanOverallQual = mean(OverallQual)) %>% ungroup() %>% mutate(PostWWII = ifelse(YearBuilt > 1945, "Post WWII", "Pre WWII")) %>% ggplot(aes(x = YearBuilt, y = MeanSalePrice, color = TotalSqFt_binned)) + geom_point(alpha = 0.5, size = 3) + hrbrthemes::theme_ipsum() + theme( panel.grid.minor = element_blank(), legend.position = 'right' # panel.grid.minor = element_blank() ) + labs( x = "Year of construction", y = "Mean sales price", title = "Ames housing data", caption = "Andrew is really cool", color = 'Total square feet' ) + # facet_wrap(~ PostWWII) + scale_y_continuous(labels = scales::comma) + scale_x_continuous(breaks = seq(1900, 2010, 50)) + scale_color_viridis_c(option = 'magma', begin = 0, end = 0.75) + scale_size_continuous(range = c(0.2,3)) ``` ## Coding errors First, how is the project going? Is anyone struggling with any errors that they need help with? Erik came across a very annoying error that I want you all to know about - Took me a minute to figure this one out - Some of you may not have come across it If you used some of Ed's code, specfically the `predict()` function he used, you will come across an error that looks like this ```{r} # Train a model -------------------------------------------------------------------------- # Train the model model_trained = rand_forest(mode = "regression", mtry = 12, trees = 10000) %>% set_engine("ranger", seed = 12345, num.threads = 10) %>% fit_xy( y = train_dt[,SalePrice], x = train_dt[,-c("Id", "SalePrice")] %>% select(MSSubClass:Foundation, KitchenQual, PoolArea:SaleCondition) ) # Predict onto testing data new_predictions = predict( model_trained, new_data = test_dt ) # Data to submit # NOTE: Names and capitalization matter to_submit = data.frame( Id = test_dt$Id, SalePrice = new_predictions$.pred ) # # File to submit # write_csv( # x = to_submit, # path = here("data", "to-submit-001.csv") # ) ``` ```{r} # Train a model -------------------------------------------------------------------------- #- testing performance model_trained_new = lm(SalePrice ~ LotArea, data = train_dt) # predictions_testing = predict( # object = model_trained_new, # new_data = test_dt # ) %>% as.data.frame() # # test_performance = data.frame( # Id = test_dt$Id, # SalePrice = predictions_testing$. # ) ``` ```{r} # Train a model -------------------------------------------------------------------------- #- testing performance model_trained_new = lm(SalePrice ~ LotArea, data = train_dt) predictions_testing = predict( object = model_trained_new, newdata = test_dt ) %>% as.data.frame() test_performance = data.frame( Id = test_dt$Id, SalePrice = predictions_testing$. ) ```