In an effort to make this lab as useful as possible, I am going to change things up a little bit from last week.
As the quarter progresses, I hope to spend more time providing materials, tips/tricks, methods etc. for helping you all finish the projects and understand how to code what we are doing in class.
I will continue to make adjustments to lab to make this time more productive- any feedback is greatly appreciated.
Furthermore, I am keeping a list of “miscellaneous” topics to go over that I have found super helpful. Let me know if there are any particular topics you all are interested in!
We talked about:
dplyr
verbs (ran out of time for most of these)
select()
filter()
arrange()
mutate()
group_by
+ summarize()
This week I want to apply these topics with an actual project- project-000
(i.) Setup project-000
here()
(ii.) Question 03 “Get a feel for the data- graphs, summaries, etc”
dpylr
and ggplot2
(iii.) Coding errors
First open up Rstudio and close any projects; start a new project - I recommend a new Rstudio project for each project in this class
Note: Having a clean and organized file system is extremely important!
Here’s how I would set up my filesystem:
Home > Documents > classes > prediction > projects > project-000
Within project-000
I would create two folders, one for r scripts/markdown called R/
and one for data called data/
I have posted the a link to a .zip
file of the housing data on the README.md
. Download this to your data/
folder and unzip it
For simplicity I copied Ed’s data.table()
code from from Connor Lennon’s Rpubs page page on kaggle notebooks on a new R
script
Note: Let me know if you guys are interested in learning about data.table
!!
Question 03 (bullet 2) says “Get a feel for the data- graphs, summaries, etc”
I cannot stress enough how important it is that you do this with each new dataset you come across!
Before you ever run a regression:
View()
, glimpse()
, or skimr::skim()
First, how is the project going? Is anyone struggling with any errors that they need help with?
Erik came across a very annoying error that I want you all to know about - Took me a minute to figure this one out - Some of you may not have come across it
If you used some of Ed’s code, specfically the predict()
function he used, you will come across an error that looks like this
# Train a model --------------------------------------------------------------------------
# Train the model
= rand_forest(mode = "regression", mtry = 12, trees = 10000) %>%
model_trained set_engine("ranger", seed = 12345, num.threads = 10) %>%
fit_xy(
y = train_dt[,SalePrice],
x = train_dt[,-c("Id", "SalePrice")] %>%
select(MSSubClass:Foundation, KitchenQual, PoolArea:SaleCondition)
)# Predict onto testing data
= predict(
new_predictions
model_trained,new_data = test_dt
)# Data to submit
# NOTE: Names and capitalization matter
= data.frame(
to_submit Id = test_dt$Id,
SalePrice = new_predictions$.pred
)# # File to submit
# write_csv(
# x = to_submit,
# path = here("data", "to-submit-001.csv")
# )
# Train a model --------------------------------------------------------------------------
#- testing performance
= lm(SalePrice ~ LotArea,
model_trained_new data = train_dt)
# predictions_testing = predict(
# object = model_trained_new,
# new_data = test_dt
# ) %>% as.data.frame()
#
# test_performance = data.frame(
# Id = test_dt$Id,
# SalePrice = predictions_testing$.
# )
# Train a model --------------------------------------------------------------------------
#- testing performance
= lm(SalePrice ~ LotArea,
model_trained_new data = train_dt)
= predict(
predictions_testing object = model_trained_new,
newdata = test_dt
%>% as.data.frame()
)
= data.frame(
test_performance Id = test_dt$Id,
SalePrice = predictions_testing$.
)