class: center, middle, inverse, title-slide .title[ # Midterm Assignment ] .subtitle[ ## Kaggle Prediction Competition ] .author[ ### Itamar Caspi ] .date[ ### May 18, 2025 (updated: 2025-05-18) ] --- # Kaggle: A Global Hub for Data Science Competitions .pull-left[ - **Kaggle** is a *vibrant data science community* where machine learning practitioners worldwide compete. - Public companies and private users alike upload the datasets used in Kaggle competitions. - A "kaggler" clinches victory by developing the most *accurate* algorithm for a specific dataset. - Kaggle competitions serve as *platforms* for practicing ML skills and keeping abreast of state-of-the-art ML methods. ] .pull-right[ <img src="figs/titanic.jpg" width="50%" style="display: block; margin: auto;" /> ] --- # Getting Started with Kaggle .pull-left[ 1. **Step One:** Visit [www.kaggle.com](www.kaggle.com) and sign-up. 2. **Step Two:** Navigate to the ml4econ course competition (link on Moodle) 3. **Step Three:** Thoroughly review the competition details including objectives, deadlines, data, evaluation criteria, submission rules, and so on. ] .pull-right[ <img src="figs/ml4econ-kaggle.png" style="display: block; margin: auto;" /> ] --- # Understanding the Kaggle Competition Data Structure .pull-left[ - *Immediate Feedback:* The Mean Squared Error (MSE) for the public test set (30%) is available immediately upon submission. - *Delayed Feedback:* The MSE for the private test set (70%) is disclosed only after the competition closes. - *Unpredictable Split:* The division between the public and private test sets is arbitrary and undisclosed to competitors in advance. Remember, your *final ranking* hinges on your performance on the *private* test set. ] .pull-right[ <img src="figs/mse.png" style="display: block; margin: auto;" /> ] --- # Mastering the Basic Kaggle Competition Workflow 1. **Acquire Domain Knowledge:** Build understanding around the problem domain. 2. **Explore the Data:** Perform an initial data analysis. 3. **Preprocessing:** Employ techniques such as standardization, creating dummy variables, determining interactions, etc. 4. **Choose a Model Class:** Decide on a model class, like Lasso, Ridge, Trees, and so forth. 5. **Tune Complexity:** Use cross-validation for optimizing model parameters. 6. **Submit Your Prediction:** Forward your model's prediction for evaluation. 7. **Document Your Workflow:** Keep a well-structured record of your process using *R Markdown*. --- # Monitoring Your Performance .pull-left[ - Leverage the public leaderboard to *track your performance*. - Your interim ranking (reflected in the "scores" column) is determined by your MSE on the public test set. - After the competition closes, the final ranking will hinge on the MSE on the private test set. - While you may submit multiple predictions, exercise caution to avoid overfitting the public test set! ] .pull-right[ <img src="figs/tracking.png" style="display: block; margin: auto;" /> ] --- # Kickstarting Your Kaggle Journey - Link to competition website will be made available on Moodle - Executing this code chunk will automatically download the essential data for our Kaggle competition. This includes train data, test data, and a sample submission file. ``` r train <- read.csv("https://raw.githubusercontent.com/ml4econ/lecture-notes-2025/master/a-kaggle/data/train.csv") test <- read.csv("https://raw.githubusercontent.com/ml4econ/lecture-notes-2025/master/a-kaggle/data/test.csv") sample_submission <- read.csv("https://raw.githubusercontent.com/ml4econ/lecture-notes-2025/master/a-kaggle/data/sample_submission.csv") ``` --- class: .title-slide-final, center, inverse, middle # `slides |> end()` [<i class="fa fa-github"></i> Source code](https://raw.githack.com/ml4econ/notes-spring2025/master/a-kaggle/a-kaggle.html)