# The following function prints the current working directory
getwd()
EC524W25: Lab 001
Workflow and cleaning
Autodidacticism (ˌɔː.təʊ.dɪˈdæk.tɪ.sɪ.zəm)
Autodidacticism or self-education (self-learning, self-study, self-teaching) is the practice of acquiring knowledge or skills without formal guidance from teachers, professors, or educational institutions.
Lab goals:
- Introduction to Quarto: A short introduction to a new tool for writing documents
- Practice using RStudio projects: Create a new project for the lab and load the data
dplyr
data manipulation questions: Three practice dplyr problems for review
Quarto is a new document format that is designed to be a more flexible and powerful alternative to R Markdown. Quarto documents are written in markdown and can include code chunks that are executed in a variety of languages such as R and Python.
For the remainder of the quarter, I will be providing you with a template Quarto document with instructions to complete each lab. Code chunks will be left blank for you to fill in, as if this were a normal R script. Before getting started, take a 5 minutes to read through the “Hello, Quarto” tutorial.
Finally, make sure this document compiles properly by clicking the Render button above, or by pressing Cmd/Ctrl + Shift + K
.
Before we can start any analysis, we need to load the data in RStudio. There are many ways to do this, but just a little forward thinking on project work flow will save us headaches later.
As discussed last week, we will be using RStudio projects to manage our work. This is a great way to keep organized between separate projects (assignments) and simplify the process of loading data and saving files. To recap, here are the steps to create a new project:
- Create a new directory for EC524, somewhere (e.g., Desktop, Documents, iCloud, Dropbox, etc.)
- Within this directory, create the following subdirectories:
EC524W25 # Class folder
└── lab # Lab folder
└── 001-projects # Separate folder for the lab 000 project
- Open RStudio
- Click on File > New Project… > Existing Directory and navigate to the separate project folder under “lab” > “001-projects”1 and click Create Project. RSudio will open a new session with the working directory set to the project folder
- Move the data files and Quarto document into the project folder
- Open the
doc001.qmd
file in the project folder to get started
After completing these steps, here is an visualization of a well organized directory structure:
EC524W25 # Class folder
└──lab # Lab folder
├── 001-projects # Project folder
│ ├── 001-projects.Rproj # Rproj file (we are here)
│ ├── data # Data folder
│ │ ├── data_description.txt
│ │ ├── sample_submission.csv
│ │ ├── test.csv
│ │ └── train.csv
│ └── doc001.qmd # Quarto document
└── 002-... # Future lab projects
Loading the data
Now that we have our project set up, let’s load the data. First, let’s see where we are on our computers
You should find the output of the getwd()
function to be the path to the project folder. This is our working directory, or .
for short. This is where R will look for files when we use relative paths.
Let’s load the data in the code chunk below using the read_csv
function. For quick reference for this function type ?read_csv
in the console.
# Load train.csv
= read_csv("data/train.csv") train_df
Use Tab
completion to help you fill in the code. For example, type read_csv("data/
and press Tab
to see the files in the data folder.
If the read worked, the following code should print out the first 6 rows in the console:
# Print the first 6 rows of the data
head(train_df)
If this worked, we are ready to move on. If not, double check your work or wait for me to come around and help.
If you have managed to load the “train.csv” data as the train_df
object, the following questions will test your knowledge of the dplyr
package. There are three questions (easy, medium, and hard), each with a code chunk to fill in your answer.
Solution. Before starting the questions, I would clean things up first:
# Load convenience packages
library(magrittr)
library(janitor)
# Clean names
%<>% clean_names()
train_df
# Rename first and second floor columns
%<>%
train_df rename(first_floor_sqft = x1st_flr_sf,
second_floor_sqft = x2nd_flr_sf)
Question 01
(easy)
Task: Filter the dataset to only include rows with the two story houses that were built since 2000. Return only the following columns in the output: Id
, YearBuilt
, HouseStyle
. Order the result by year built in descending order.
Solution.
# Your answer here
%>%
train_df filter(house_style == "2Story" & year_built > 2000) %>%
select(id, year_built, house_style) %>%
arrange(desc(year_built))
# A tibble: 143 × 3
id year_built house_style
<dbl> <dbl> <chr>
1 88 2009 2Story
2 158 2009 2Story
3 213 2009 2Story
4 461 2009 2Story
5 573 2009 2Story
6 763 2009 2Story
7 915 2009 2Story
8 454 2008 2Story
9 528 2008 2Story
10 592 2008 2Story
# ℹ 133 more rows
Expected output
# A tibble: 143 × 3
| Id | year_built | house_style |
|------|------------|-------------|
| 88 | 2009 | 2Story |
| 158 | 2009 | 2Story |
| 213 | 2009 | 2Story |
| 461 | 2009 | 2Story |
| 573 | 2009 | 2Story |
| 763 | 2009 | 2Story |
Question 02
(medium)
Task: Create a new column called TotalSF
that is the sum of the 1stFlrSF
and 2ndFlrSF
columns. Filter the dataset to only include rows with a total square footage greater than 3,000. Return only the Id
and TotalSF
columns in the output.
Solution.
# Your answer here
%>%
train_df mutate(total_sqft = first_floor_sqft + second_floor_sqft) %>%
filter(total_sqft > 3000) %>%
select(id, total_sqft)
# A tibble: 14 × 2
id total_sqft
<dbl> <dbl>
1 119 3222
2 186 3036
3 305 3493
4 497 3228
5 524 4676
6 609 3194
7 692 4316
8 770 3279
9 799 3140
10 1170 3627
11 1183 4476
12 1269 3447
13 1299 5642
14 1354 3238
Expected output
# A tibble: 12 × 2
| id | total_sqft |
|------|------------|
| 119 | 3222 |
| 186 | 3036 |
| 305 | 3493 |
| 497 | 3228 |
| 524 | 4676 |
| 609 | 3194 |
| 692 | 4316 |
| 770 | 3279 |
| 799 | 3140 |
| 1170 | 3627 |
| 1183 | 4476 |
| 1269 | 3447 |
| 1299 | 5642 |
| 1354 | 3238 |
Question 03
(hard)
Task: From the dataset, find the average YearBuilt
for each HouseStyle
, but only include HouseStyle
s where more than 20 houses were built after the year 2000. Sort the resulting data frame by YearBuilt
in descending order.
Solution.
# Your answer here
%>%
train_df filter(year_built > 2000)%>%
group_by(house_style) %>%
summarise(average_year_built = mean(year_built),
n = n()) %>%
filter(n > 5) %>%
arrange(desc(average_year_built))
# A tibble: 3 × 3
house_style average_year_built n
<chr> <dbl> <int>
1 1Story 2005. 211
2 2Story 2005. 143
3 SLvl 2004. 6
Expected output
# A tibble: 3 × 3
| house_style | average_year_built | n |
|-------------|--------------------|-------|
| 1Story | 2005.0 | 211 |
| 2Story | 2005.0 | 143 |
| SLvl | 2004.0 | 6 |
Footnotes
Note: You can name your folders whatever you like, these are just my naming conventions↩︎