EC524W25: Lab 001

Workflow and cleaning

Author

Andrew Dickinson

Autodidacticism (ˌɔː.təʊ.dɪˈdæk.tɪ.sɪ.zəm)
Autodidacticism or self-education (self-learning, self-study, self-teaching) is the practice of acquiring knowledge or skills without formal guidance from teachers, professors, or educational institutions.

Lab goals:

  1. Introduction to Quarto: A short introduction to a new tool for writing documents
  2. Practice using RStudio projects: Create a new project for the lab and load the data
  3. dplyr data manipulation questions: Three practice dplyr problems for review

Quarto is a new document format that is designed to be a more flexible and powerful alternative to R Markdown. Quarto documents are written in markdown and can include code chunks that are executed in a variety of languages such as R and Python.

For the remainder of the quarter, I will be providing you with a template Quarto document with instructions to complete each lab. Code chunks will be left blank for you to fill in, as if this were a normal R script. Before getting started, take a 5 minutes to read through the “Hello, Quarto” tutorial.

Finally, make sure this document compiles properly by clicking the Render button above, or by pressing Cmd/Ctrl + Shift + K.

Before we can start any analysis, we need to load the data in RStudio. There are many ways to do this, but just a little forward thinking on project work flow will save us headaches later.

As discussed last week, we will be using RStudio projects to manage our work. This is a great way to keep organized between separate projects (assignments) and simplify the process of loading data and saving files. To recap, here are the steps to create a new project:

  1. Create a new directory for EC524, somewhere (e.g., Desktop, Documents, iCloud, Dropbox, etc.)
  2. Within this directory, create the following subdirectories:

      EC524W25              # Class folder
      └── lab               # Lab folder
          └── 001-projects  # Separate folder for the lab 000 project
      
  1. Open RStudio
  2. Click on File > New Project… > Existing Directory and navigate to the separate project folder under “lab” > “001-projects”1 and click Create Project. RSudio will open a new session with the working directory set to the project folder
  3. Move the data files and Quarto document into the project folder
  4. Open the doc001.qmd file in the project folder to get started

After completing these steps, here is an visualization of a well organized directory structure:


   EC524W25                                # Class folder  
   └──lab                                  # Lab folder
      ├── 001-projects                     # Project folder
      │   ├── 001-projects.Rproj           # Rproj file (we are here)
      │   ├── data                         # Data folder
      │   │   ├── data_description.txt     
      │   │   ├── sample_submission.csv    
      │   │   ├── test.csv                 
      │   │   └── train.csv                
      │   └── doc001.qmd                   # Quarto document
      └── 002-...                          # Future lab projects
      

Loading the data

Now that we have our project set up, let’s load the data. First, let’s see where we are on our computers

# The following function prints the current working directory
getwd()

You should find the output of the getwd() function to be the path to the project folder. This is our working directory, or . for short. This is where R will look for files when we use relative paths.

Let’s load the data in the code chunk below using the read_csv function. For quick reference for this function type ?read_csv in the console.

# Load train.csv 
train_df = read_csv("data/train.csv")
Tip

Use Tab completion to help you fill in the code. For example, type read_csv("data/ and press Tab to see the files in the data folder.

If the read worked, the following code should print out the first 6 rows in the console:

# Print the first 6 rows of the data
head(train_df)

If this worked, we are ready to move on. If not, double check your work or wait for me to come around and help.

If you have managed to load the “train.csv” data as the train_df object, the following questions will test your knowledge of the dplyr package. There are three questions (easy, medium, and hard), each with a code chunk to fill in your answer.


Solution. Before starting the questions, I would clean things up first:

# Load convenience packages
library(magrittr)
library(janitor)

# Clean names
train_df %<>% clean_names()

# Rename first and second floor columns
train_df %<>%
  rename(first_floor_sqft = x1st_flr_sf,
         second_floor_sqft = x2nd_flr_sf)

Question 01

(easy)

Task: Filter the dataset to only include rows with the two story houses that were built since 2000. Return only the following columns in the output: Id, YearBuilt, HouseStyle. Order the result by year built in descending order.

Solution.

# Your answer here
train_df %>%
  filter(house_style == "2Story" & year_built > 2000) %>%
  select(id, year_built, house_style) %>%
  arrange(desc(year_built))
# A tibble: 143 × 3
      id year_built house_style
   <dbl>      <dbl> <chr>      
 1    88       2009 2Story     
 2   158       2009 2Story     
 3   213       2009 2Story     
 4   461       2009 2Story     
 5   573       2009 2Story     
 6   763       2009 2Story     
 7   915       2009 2Story     
 8   454       2008 2Story     
 9   528       2008 2Story     
10   592       2008 2Story     
# ℹ 133 more rows

Expected output

# A tibble: 143 × 3
|   Id | year_built | house_style |
|------|------------|-------------|
|   88 |       2009 | 2Story      |
|  158 |       2009 | 2Story      |
|  213 |       2009 | 2Story      |
|  461 |       2009 | 2Story      |
|  573 |       2009 | 2Story      |
|  763 |       2009 | 2Story      |


Question 02

(medium)

Task: Create a new column called TotalSF that is the sum of the 1stFlrSF and 2ndFlrSF columns. Filter the dataset to only include rows with a total square footage greater than 3,000. Return only the Id and TotalSF columns in the output.

Solution.

# Your answer here
train_df %>%
  mutate(total_sqft = first_floor_sqft + second_floor_sqft) %>%
  filter(total_sqft > 3000) %>%
  select(id, total_sqft)
# A tibble: 14 × 2
      id total_sqft
   <dbl>      <dbl>
 1   119       3222
 2   186       3036
 3   305       3493
 4   497       3228
 5   524       4676
 6   609       3194
 7   692       4316
 8   770       3279
 9   799       3140
10  1170       3627
11  1183       4476
12  1269       3447
13  1299       5642
14  1354       3238

Expected output

# A tibble: 12 × 2
|   id | total_sqft |
|------|------------|
|  119 |       3222 |
|  186 |       3036 |
|  305 |       3493 |
|  497 |       3228 |
|  524 |       4676 |
|  609 |       3194 |
|  692 |       4316 |
|  770 |       3279 |
|  799 |       3140 |
| 1170 |       3627 |
| 1183 |       4476 |
| 1269 |       3447 |
| 1299 |       5642 |
| 1354 |       3238 |


Question 03

(hard)

Task: From the dataset, find the average YearBuilt for each HouseStyle, but only include HouseStyles where more than 20 houses were built after the year 2000. Sort the resulting data frame by YearBuilt in descending order.

Solution.

# Your answer here
train_df %>%
  filter(year_built > 2000)%>%
  group_by(house_style) %>%
  summarise(average_year_built = mean(year_built), 
            n = n()) %>%
  filter(n > 5) %>%
  arrange(desc(average_year_built))
# A tibble: 3 × 3
  house_style average_year_built     n
  <chr>                    <dbl> <int>
1 1Story                   2005.   211
2 2Story                   2005.   143
3 SLvl                     2004.     6

Expected output

# A tibble: 3 × 3
| house_style | average_year_built |   n   |
|-------------|--------------------|-------|
| 1Story      | 2005.0             |  211  |
| 2Story      | 2005.0             |  143  |
| SLvl        | 2004.0             |    6  |

Footnotes

  1. Note: You can name your folders whatever you like, these are just my naming conventions↩︎