class: center, middle, inverse, title-slide # Midterm Assignment ## Kaggle Prediction Competition ### Itamar Caspi ### April 14, 2019 (updated: 2019-04-15) --- # What is Kaggle? .pull-left[ - Kaggle is a huge data science community where machine learning practitioners around the world compete against each other. - The datasets used in Kaggle are uploaded by public companies as well as private users. - A "kaggler" wins if her algorithm is the most accurate on a particular data set. - Kaggle competitions are one of the best places to practice your ML skills and learn about state-of-the-art ML method. ] .pull-right[ <img src="figs/titanic.jpg" width="50%" style="display: block; margin: auto;" /> ] --- # Introduce yourself to Kaggle .pull-left[ 1. Visit [www.kaggle.com](www.kaggle.com) and sing-up. 2. Go to the ml4econ course competition webpage: [https://www.kaggle.com/c/55750-machine-learning-for-economists-huji-2019/data](https://www.kaggle.com/c/55750-machine-learning-for-economists-huji-2019/data). 3. Review competition details: objectives, deadline, data, evaluation, submission rules, etc. ] .pull-right[ <img src="figs/ml4econ-kaggle.png" width="1265" style="display: block; margin: auto;" /> ] --- # Kaggle competition data structure .pull-left[ - MSE for the public test immediately available at submission. - MSE for the private test available only once the competition closes. - The split between public and private test sets is arbitrary and unknown in advance. Your final ranking is based on how well you perform on the *private* test set. ] .pull-right[ <img src="figs/mse.png" width="865" style="display: block; margin: auto;" /> ] --- # The basic Kaggle competition workflow 1. Acquire domain knowledge. 2. Explore the data. 3. Preprocessing (standardization, dummies, interactions, etc.). 4. Choose a model class (Lasso, trees, NN, etc.). 5. Tune complexity (Cross validation). 6. Submit your prediction. 7. __Document your workflow (R Markdown)__ --- # Tracking your performance .pull-left[ - Use the public lead-board to track your performance. - Your ranking ("scores" column) is based on your MSE on the public test set. - Once the competition is closed, the final ranking will be based on the MSE on the private test set. - Your can submit multiple predictions but be careful not to overfit the public test set! ] .pull-right[ <img src="figs/tracking.png" width="1085" style="display: block; margin: auto;" /> ] --- # Getting started Running the following code chunk will automatically download the data you need for our kaggle competition and create a new R project and open it in a new RStudio session: ```r # install.packages("usethis") library(usethis) use_course("https://github.com/ml4econ/course-spring2019/blob/master/kaggle/kaggle.zip?raw=true") ``` __NOTE:__ By default, the new project will be created on your desktop. --- class: .title-slide-final, center, inverse, middle # `slides %>% end()` [<i class="fa fa-github"></i> Source code](https://raw.githack.com/ml4econ/notes-spring2019/master/a-kaggle/a-kaggle.html)