Introduction

In an effort to make this lab as useful as possible, I am going to change things up a little bit from last week.

  • Going to try to move away from lecture slides, standing in the front of the room with a clicker
  • Be more active; live coding; working with data (that you are using in your projects) in front of you all
  • Sit down and code more rather talk about it
  • Give more tips and tricks; provide examples and code snippets

As the quarter progresses, I hope to spend more time providing materials, tips/tricks, methods etc. for helping you all finish the projects and understand how to code what we are doing in class.

I will continue to make adjustments to lab to make this time more productive- any feedback is greatly appreciated.

Furthermore, I am keeping a list of “miscellaneous” topics to go over that I have found super helpful. Let me know if there are any particular topics you all are interested in!

Last week

We talked about:

  • Rstudio
  • Projects in Rstudio
  • Pipe operators
  • dplyr verbs (ran out of time for most of these)
    • select()
    • filter()
    • arrange()
    • mutate()
    • group_by + summarize()

This week I want to apply these topics with an actual project- project-000

Outline

(i.) Setup project-000

  • File management
  • Using here()
  • Writing scripts

(ii.) Question 03 “Get a feel for the data- graphs, summaries, etc”

  • View()
  • skim()
  • clean_names()
  • Using pipes, dpylr and ggplot2

(iii.) Coding errors

  • Erik’s coding error
    • Namespace conflicts
  • How to google for help

Setup project-000

First open up Rstudio and close any projects; start a new project - I recommend a new Rstudio project for each project in this class

Note: Having a clean and organized file system is extremely important!

Here’s how I would set up my filesystem:

Home > Documents > classes > prediction > projects > project-000

Within project-000 I would create two folders, one for r scripts/markdown called R/ and one for data called data/

I have posted the a link to a .zip file of the housing data on the README.md. Download this to your data/ folder and unzip it

For simplicity I copied Ed’s data.table() code from from Connor Lennon’s Rpubs page page on kaggle notebooks on a new R script

Note: Let me know if you guys are interested in learning about data.table!!

“Get a feel for the data”

Question 03 (bullet 2) says “Get a feel for the data- graphs, summaries, etc”

I cannot stress enough how important it is that you do this with each new dataset you come across!

Before you ever run a regression:

  • plot the data to look for patterns
  • Look at the data set using View(), glimpse(), or skimr::skim()


Coding errors

First, how is the project going? Is anyone struggling with any errors that they need help with?

Erik came across a very annoying error that I want you all to know about - Took me a minute to figure this one out - Some of you may not have come across it

If you used some of Ed’s code, specfically the predict() function he used, you will come across an error that looks like this

# Train a model --------------------------------------------------------------------------
# Train the model
model_trained = rand_forest(mode = "regression", mtry = 12, trees = 10000) %>%
  set_engine("ranger", seed = 12345, num.threads = 10) %>%
  fit_xy(
    y = train_dt[,SalePrice],
    x = train_dt[,-c("Id", "SalePrice")] %>%
      select(MSSubClass:Foundation, KitchenQual, PoolArea:SaleCondition)
  )
# Predict onto testing data
new_predictions = predict(
  model_trained,
  new_data = test_dt
)
# Data to submit
# NOTE: Names and capitalization matter
to_submit = data.frame(
  Id = test_dt$Id,
  SalePrice = new_predictions$.pred
)
# # File to submit
# write_csv(
#   x = to_submit,
#   path = here("data", "to-submit-001.csv")
# )
# Train a model --------------------------------------------------------------------------
#- testing performance 

model_trained_new = lm(SalePrice ~ LotArea,
                         data = train_dt)

# predictions_testing = predict(
#   object = model_trained_new,
#   new_data = test_dt
# ) %>% as.data.frame()
# 
# test_performance = data.frame(
#   Id = test_dt$Id,
#   SalePrice = predictions_testing$.
# )
# Train a model --------------------------------------------------------------------------
#- testing performance 

model_trained_new = lm(SalePrice ~ LotArea,
                         data = train_dt)

predictions_testing = predict(
  object = model_trained_new,
  newdata = test_dt
) %>% as.data.frame()

test_performance = data.frame(
  Id = test_dt$Id,
  SalePrice = predictions_testing$.
)