notes

import%20marimo%0A%0A__generated_with%20%3D%20%220.16.0%22%0Aapp%20%3D%20marimo.App(width%3D%22medium%22)%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_()%3A%0A%20%20%20%20import%20marimo%20as%20mo%0A%20%20%20%20import%20pandas%20as%20pd%0A%20%20%20%20return%20mo%2C%20pd%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(%0A%20%20%20%20%20%20%20%20r%22%22%22%0A%20%20%20%20%23%20Machine%20Learning%20Zoomcamp%0A%0A%20%20%20%20%23%23%20Module%201%3A%20**Introduction%20to%20Machine%20Learning**%0A%20%20%20%20%22%22%22%0A%20%20%20%20)%0A%20%20%20%20return%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_(pd)%3A%0A%20%20%20%20repository_root%20%3D%20%22https%3A%2F%2Fgithub.com%2FDataTalksClub%2Fmachine-learning-zoomcamp%2Fblob%2Fmaster%2F%22%0A%0A%20%20%20%20chapters%20%3D%20pd.DataFrame(%5B%0A%20%20%20%20%20%20%20%20%7B%0A%20%20%20%20%20%20%20%20%20%20%20%20%22title%22%3A%20%22Introduction%20to%20Machine%20Learning%22%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%22youtube_id%22%3A%20%22Crm_5n4mvmg%22%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%22contents%22%3A%20repository_root%2B%2201-intro%2F01-what-is-ml.md%22%0A%20%20%20%20%20%20%20%20%7D%2C%0A%20%20%20%20%20%20%20%20%7B%0A%20%20%20%20%20%20%20%20%20%20%20%20%22title%22%3A%20%22Machine%20Learning%20vs.%20Rule-Based%20Systems%22%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%22youtube_id%22%3A%20%22CeukwyUdaz8%22%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%22contents%22%3A%20repository_root%2B%2201-intro%2F02-ml-vs-rules.md%22%0A%20%20%20%20%20%20%20%20%7D%2C%0A%20%20%20%20%20%20%20%20%7B%0A%20%20%20%20%20%20%20%20%20%20%20%20%22title%22%3A%20%22Supervised%20Machine%20Learning%22%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%22youtube_id%22%3A%20%22j9kcEuGcC2Y%22%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%22contents%22%3A%20repository_root%2B%2201-intro%2F03-supervised-ml.md%22%0A%20%20%20%20%20%20%20%20%7D%2C%0A%20%20%20%20%20%20%20%20%7B%0A%20%20%20%20%20%20%20%20%20%20%20%20%22title%22%3A%20%22CRoss-Industry%20Standard%20Process%20for%20Data%20Mining%22%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%22youtube_id%22%3A%20%22dCa3JvmJbr0%22%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%22contents%22%3A%20repository_root%2B%2201-intro%2F04-crisp-dm.md%22%0A%20%20%20%20%20%20%20%20%7D%2C%0A%20%20%20%20%20%20%20%20%7B%0A%20%20%20%20%20%20%20%20%20%20%20%20%22title%22%3A%20%22Model%20Selection%20Process%22%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%22youtube_id%22%3A%20%22OH_R0Sl9neM%22%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%22contents%22%3A%20repository_root%2B%2201-intro%2F05-model-selection.md%22%0A%20%20%20%20%20%20%20%20%7D%2C%0A%20%20%20%20%20%20%20%20%7B%0A%20%20%20%20%20%20%20%20%20%20%20%20%22title%22%3A%20%22GitHub%20Codespaces%22%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%22youtube_id%22%3A%20%22pqQFlV3f9Bo%22%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%22contents%22%3A%20repository_root%2B%2201-intro%2F06-environment.md%22%0A%20%20%20%20%20%20%20%20%7D%2C%0A%20%20%20%20%20%20%20%20%7B%0A%20%20%20%20%20%20%20%20%20%20%20%20%22title%22%3A%20%22Introduction%20to%20NumPy%22%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%22youtube_id%22%3A%20%22Qa0-jYtRdbY%22%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%22contents%22%3A%20repository_root%2B%2201-intro%2F07-numpy.md%22%0A%20%20%20%20%20%20%20%20%7D%2C%0A%20%20%20%20%20%20%20%20%7B%0A%20%20%20%20%20%20%20%20%20%20%20%20%22title%22%3A%20%22Linear%20Algebra%20Refresher%22%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%22youtube_id%22%3A%20%22zZyKUeOR4Gg%22%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%22contents%22%3A%20repository_root%2B%2201-intro%2F08-linear-algebra.md%22%0A%20%20%20%20%20%20%20%20%7D%2C%0A%20%20%20%20%20%20%20%20%7B%0A%20%20%20%20%20%20%20%20%20%20%20%20%22title%22%3A%20%22Introduction%20to%20Pandas%22%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%22youtube_id%22%3A%20%220j3XK5PsnxA%22%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%22contents%22%3A%20repository_root%2B%2201-intro%2F09-pandas.md%22%0A%20%20%20%20%20%20%20%20%7D%2C%0A%20%20%20%20%20%20%20%20%7B%0A%20%20%20%20%20%20%20%20%20%20%20%20%22title%22%3A%20%22Final%20Summary%22%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%22youtube_id%22%3A%20%22VRrEEVeJ440%22%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%22contents%22%3A%20repository_root%2B%2201-intro%2F10-summary.md%22%0A%20%20%20%20%20%20%20%20%7D%2C%0A%20%20%20%20%20%20%20%20%7B%0A%20%20%20%20%20%20%20%20%20%20%20%20%22title%22%3A%20%22Homework%22%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%22contents%22%3A%20repository_root%2B%2201-intro%2Fhomework.md%22%0A%20%20%20%20%20%20%20%20%7D%0A%20%20%20%20%5D)%0A%0A%20%20%20%20chapters.insert(loc%3D0%2C%20column%3D%22snapshot%22%2C%20value%3D%22https%3A%2F%2Fimg.youtube.com%2Fvi%2F%22%2Bchapters.youtube_id.astype(str)%2B%22%2Fhqdefault.jpg%22)%0A%20%20%20%20chapters.insert(loc%3D2%2C%20column%3D%22youtube%22%2C%20value%3D%22https%3A%2F%2Fyoutube.com%2Fwatch%3Fv%3D%22%2Bchapters.youtube_id.astype(str))%0A%0A%20%20%20%20videos%20%3D%20chapters%5Bchapters%5B%22youtube_id%22%5D.notnull()%5D%0A%20%20%20%20videos%5B%5B%22snapshot%22%2C%20%22title%22%2C%20%22youtube%22%5D%5D%0A%20%20%20%20return%20(chapters%2C)%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_(chapters)%3A%0A%20%20%20%20contents%20%3D%20chapters%5Bchapters%5B%22contents%22%5D.notnull()%5D%0A%20%20%20%20contents%5B%5B%22title%22%2C%20%22contents%22%5D%5D%0A%20%20%20%20return%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(%0A%20%20%20%20%20%20%20%20r%22%22%22%0A%20%20%20%20%23%23%20Introduction%20to%20Machine%20Learning%0A%0A%20%20%20%20%2F%2F%2F%20details%20%7C%20Car%20Prices%20Dataset%0A%20%20%20%20%20%20%20%20type%3A%20info%0A%0A%20%20%20%20To%20make%20this%20notes%20more%20realistic%2C%20we'll%20use%20Kaggle's%20%5Bsidharth178%2Fcar-prices-dataset%5D(https%3A%2F%2Fwww.kaggle.com%2Fdatasets%2Fsidharth178%2Fcar-prices-dataset)%20dataset.%0A%20%20%20%20%2F%2F%2F%0A%20%20%20%20%22%22%22%0A%20%20%20%20)%0A%20%20%20%20return%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_(pd)%3A%0A%20%20%20%20car_prices%20%3D%20pd.read_csv(%22.%2Fmodule-1%2Fdata%2Fcar-prices%2Ftrain.csv%22)%0A%20%20%20%20car_prices.head()%0A%20%20%20%20return%20(car_prices%2C)%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(%0A%20%20%20%20%20%20%20%20r%22%22%22%0A%20%20%20%20To%20get%20a%20first%20idea%20of%20what%20is%20Machine%20Learning%2C%20we%20can%20imagine%20that%20we%20own%20a%20website%20where%20people%20can%20sell%20their%20used%20cars.%20A%20first%20problem%20in%20which%20we%20could%20use%20Machine%20Learning%20is%20to%20assist%20users%20when%20setting%20a%20price%20for%20their%20car.%0A%0A%20%20%20%20In%20this%20problem%20we'd%20start%20with%20the%20car's%20**features**.%20For%20instance%3A%0A%20%20%20%20%22%22%22%0A%20%20%20%20)%0A%20%20%20%20return%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_(car_prices)%3A%0A%20%20%20%20%5Bcolumn%20for%20column%20in%20car_prices.columns%20if%20not%20column%20in%20(%22ID%22%2C%20%22Price%22)%5D%0A%20%20%20%20return%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(%0A%20%20%20%20%20%20%20%20r%22%22%22%0A%20%20%20%20...%20and%20we'll%20try%20to%20guess%20its%20price.%20Typically%2C%20we'll%20call%20**target**%20to%20the%20variable%20that%20we%20are%20trying%20to%20guess.%0A%0A%20%20%20%20Having%20a%20list%20of%20cars%20that%20contains%20a%20certain%20set%20of%20features%20and%20their%20corresponding%20prices%2C%20we'll%20be%20able%20to%3A%0A%0A%20%20%20%20*%20First%2C%20we%20train%20a%20model%20so%20that%20it%20learns%20to%20relate%20prices%20with%20their%20**features**.%0A%20%20%20%20*%20Then%2C%20we%20use%20the%20model%20so%20that%20given%20a%20set%20of%20features%2C%20it%20guesses%20its%20**target**%3B%20in%20our%20case%2C%20the%20car's%20price.%0A%20%20%20%20%22%22%22%0A%20%20%20%20)%0A%20%20%20%20return%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(%0A%20%20%20%20%20%20%20%20r%22%22%22%0A%20%20%20%20%23%23%20Machine%20Learning%20vs.%20Rule-Based%20Systems%0A%0A%20%20%20%20In%20this%20chapter%20we%20compare%20the%20%22classic%22%20way%20of%20creating%20programs%20with%20the%20Machine%20Learning%20approach.%20As%20an%20example%2C%20we%20imagine%20how%20could%20we%20create%20a%20program%20that%20works%20as%20an%20antispam%20detector%2C%20receiving%20an%20email%20and%20classifying%20it%20as%20spam%20or%20not%20spam.%0A%0A%20%20%20%20%60%60%60python%0A%20%20%20%20class%20Email%3A%0A%20%20%20%20%20%20%20%20from_email_address%3A%20str%0A%20%20%20%20%20%20%20%20to_email_addresses%3A%20list%5Bstr%5D%0A%20%20%20%20%20%20%20%20cc_email_addresses%3A%20list%5Bstr%5D%0A%20%20%20%20%20%20%20%20subject%3A%20str%0A%20%20%20%20%20%20%20%20message%3A%20str%0A%20%20%20%20%60%60%60%0A%20%20%20%20%22%22%22%0A%20%20%20%20)%0A%20%20%20%20return%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(%0A%20%20%20%20%20%20%20%20r%22%22%22%0A%20%20%20%20%23%23%23%20Rule%20Based%0A%0A%20%20%20%20To%20create%20a%20rule%20based%20antispam%20we%20would%20create%20a%20program%20that%20checks%20for%20different%20things%2C%20which%20we%20would%20usually%20add%20one%20by%20one%20after%20examining%20the%20previous%20emails%20that%20we've%20been%20able%20to%20manually%20classify%20as%20either%20legit%20or%20spam%3A%0A%0A%20%20%20%20%60%60%60python%0A%20%20%20%20def%20has_suspicious_from(email%3A%20Email)%20-%3E%20bool%3A%0A%20%20%20%20%20%20%20%20suspicious_literals%20%3D%20%5B%22spam%22%2C%20%22mailinator%22%5D%0A%20%20%20%20%20%20%20%20return%20any(%5B%0A%20%20%20%20%20%20%20%20%20%20%20%20email.from_email_address.find(literal)%20%3E%20-1%0A%20%20%20%20%20%20%20%20%20%20%20%20for%20literal%20in%20suspicious_literals%0A%20%20%20%20%20%20%20%20%5D)%0A%0A%20%20%20%20def%20is_spam(email%3A%20Email)%20-%3E%20bool%3A%0A%20%20%20%20%20%20%20%20return%20has_suspicious_from(email)%0A%20%20%20%20%60%60%60%0A%0A%20%20%20%20This%20looks%20great%2C%20it%20can%20work%20and%20detect%20some%20spam%20messages%20but%20we'd%20very%20likely%20going%20to%20have%20to%20adapt%20the%20code%20as%20we%20receive%20more%20emails%20and%20we%20find%20that%20our%20code%20needs%20a%20more%20complex%20logic%2C%20for%20instance%2C%20checking%20if%20the%20email%20has%20many%20targets%20or%20copies%20to%20too%20many%20people.%0A%0A%20%20%20%20%60%60%60python%0A%20%20%20%20def%20has_suspicious_targets(email%3A%20Email)%20-%3E%20bool%3A%0A%20%20%20%20%20%20%20%20return%20len(email.to_email_addresses)%20%2B%20len(email.cc_email_addresses)%20%3E%205%0A%0A%20%20%20%20def%20is_spam(email%3A%20Email)%20-%3E%20bool%3A%0A%20%20%20%20%20%20%20%20return%20has_suspicious_from(email)%20or%20has_suspicious_targets(email)%0A%20%20%20%20%60%60%60%0A%0A%20%20%20%20The%20more%20criteria%20we%20want%20to%20take%20into%20account%2C%20the%20more%20code%20we'll%20have%20and%20the%20more%20complex%20it%20will%20become.%0A%20%20%20%20%22%22%22%0A%20%20%20%20)%0A%20%20%20%20return%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(%0A%20%20%20%20%20%20%20%20r%22%22%22%0A%20%20%20%20%23%23%23%20Machine%20Learning%0A%0A%20%20%20%20To%20create%20a%20machine%20learning%20model%20that%20solves%20the%20same%20issue%2C%20we%20follow%20a%20simpler%20process%3A%0A%0A%20%20%20%201.%20Get%20the%20data%0A%20%20%20%202.%20Define%20its%20features%20and%20create%20a%20dataset%20linking%20them%20to%20a%20target%20variable%20(the%20spam%20flag)%0A%20%20%20%203.%20Train%20a%20classifier%20model%20on%20the%20new%20dataset%0A%20%20%20%204.%20Use%20the%20trained%20model%20to%20check%20new%20emails%0A%0A%20%20%20%20Thanks%20to%20this%20approach%20we%20can%20create%20a%20**spam**%20button%20that%20users%20will%20click%20when%20they%20see%20a%20message%20that's%20spam%20and%20use%20it%20as%20our%20source%20of%20information%20to%20iteratively%20train%20our%20model%20on%20new%20spam%20and%20legitimate%20emails.%0A%20%20%20%20%22%22%22%0A%20%20%20%20)%0A%20%20%20%20return%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(%0A%20%20%20%20%20%20%20%20r%22%22%22%0A%20%20%20%20%23%23%23%20From%20Rule%20Based%20Systems%20to%20Machine%20Learning%0A%0A%20%20%20%20When%20migrating%20from%20rule%20based%20systems%20to%20machine%20learning%20approaches%20we%20don't%20have%20to%20throw%20everything%20to%20the%20trash%20can.%20Instead%2C%20we%20can%20start%20using%20many%20of%20the%20initial%20rules%20as%20the%20base%20for%20the%20features%20of%20our%20model.%0A%20%20%20%20%22%22%22%0A%20%20%20%20)%0A%20%20%20%20return%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(%0A%20%20%20%20%20%20%20%20r%22%22%22%0A%20%20%20%20%23%23%20Supervised%20Machine%20Learning%0A%0A%20%20%20%20*%20On%20one%20hand%2C%20as%20the%20**features**%20are%20in%20most%20cases%20many%20numeric%20variables%20per%20item%2C%20we'll%20use%20an%20upper%20case%20%24X%24%20to%20represent%20its%20matrix%20nature.%0A%0A%20%20%20%20*%20On%20the%20other%20hand%2C%20as%20the%20**target**%20variables%20can%20be%20represented%20by%20a%20single%20number%20per%20item%2C%20we%20can%20use%20a%20vector%20to%20represent%20the%20answers%20that%20correspond%20to%20a%20given%20input%20matrix.%0A%0A%20%20%20%20Training%20a%20model%20that%20can%20predict%20our%20target%20variable%20given%20a%20list%20of%20features%20for%20a%20batch%20of%20items%20can%20be%20seen%20as%20creating%20a%20function%20%24g%24%20such%20that%3A%0A%0A%20%20%20%20%5C%5B%0A%20%20%20%20g(X)%20%3D%20y%0A%20%20%20%20%5C%5D%0A%0A%20%20%20%20Before%20training%20a%20model%2C%20we'd%20transform%20all%20features%20into%20numerical%20representations.%0A%0A%20%20%20%20This%20is%20our%20%24X%24%20matrix%3A%0A%20%20%20%20%22%22%22%0A%20%20%20%20)%0A%20%20%20%20return%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_(car_prices%2C%20pd)%3A%0A%20%20%20%20from%20sklearn.preprocessing%20import%20OneHotEncoder%0A%20%20%20%20from%20sklearn.compose%20import%20ColumnTransformer%0A%20%20%20%20from%20sklearn.pipeline%20import%20Pipeline%0A%20%20%20%20from%20sklearn.linear_model%20import%20GammaRegressor%0A%20%20%20%20from%20sklearn.preprocessing%20import%20StandardScaler%0A%0A%20%20%20%20def%20first_pass(original)%3A%0A%20%20%20%20%20%20%20%20passed%20%3D%20original.copy()%0A%20%20%20%20%20%20%20%20passed%5B%22Levy%22%5D%20%3D%20pd.to_numeric(passed%5B%22Levy%22%5D.replace(%22-%22%2C%200)%2C%20errors%3D%22coerce%22)%0A%20%20%20%20%20%20%20%20passed%5B%22Mileage%22%5D%20%3D%20passed%5B%22Mileage%22%5D.str.replace(%22%20km%22%2C%20%22%22%2C%20regex%3DFalse).astype(float)%0A%20%20%20%20%20%20%20%20passed%5B%22Engine%20volume%22%5D%20%3D%20pd.to_numeric(passed%5B%22Engine%20volume%22%5D%2C%20errors%3D%22coerce%22)%0A%20%20%20%20%20%20%20%20passed%5B%22Engine%20volume%22%5D%20%3D%20passed%5B%22Engine%20volume%22%5D.fillna(passed%5B%22Engine%20volume%22%5D.median())%0A%20%20%20%20%20%20%20%20return%20passed%0A%0A%20%20%20%20car_train%20%3D%20first_pass(car_prices)%0A%0A%20%20%20%20X_train%20%3D%20car_train.drop(columns%3D%5B%22ID%22%2C%20%22Price%22%5D)%0A%20%20%20%20y_train%20%3D%20car_train%5B%22Price%22%5D%0A%0A%20%20%20%20numerical_features%20%3D%20X_train.select_dtypes(include%3D%5B%22int64%22%2C%20%22float64%22%5D).columns%0A%20%20%20%20categorical_features%20%3D%20X_train.select_dtypes(include%3D%5B%22object%22%5D).columns%0A%0A%20%20%20%20preprocessor%20%3D%20ColumnTransformer(%5B%0A%20%20%20%20%20%20%20%20(%22num%22%2C%20%22passthrough%22%2C%20numerical_features)%2C%0A%20%20%20%20%20%20%20%20(%22cat%22%2C%20OneHotEncoder(handle_unknown%3D%22ignore%22)%2C%20categorical_features)%0A%20%20%20%20%5D)%0A%0A%20%20%20%20preprocessor.fit(X_train)%0A%0A%20%20%20%20X_train_transformed%20%3D%20preprocessor.transform(X_train)%0A%0A%20%20%20%20feature_names%20%3D%20preprocessor.named_transformers_%5B%22cat%22%5D.get_feature_names_out(categorical_features)%0A%20%20%20%20all_feature_names%20%3D%20list(numerical_features)%20%2B%20list(feature_names)%0A%0A%20%20%20%20X_train_preprocessed%20%3D%20pd.DataFrame(%0A%20%20%20%20%20%20%20%20X_train_transformed.toarray()%20if%20hasattr(X_train_transformed%2C%20%22toarray%22)%20else%20X_train_transformed%2C%0A%20%20%20%20%20%20%20%20columns%3Dall_feature_names%0A%20%20%20%20)%0A%0A%20%20%20%20X_train_preprocessed.head()%0A%20%20%20%20return%20(%0A%20%20%20%20%20%20%20%20GammaRegressor%2C%0A%20%20%20%20%20%20%20%20Pipeline%2C%0A%20%20%20%20%20%20%20%20StandardScaler%2C%0A%20%20%20%20%20%20%20%20X_train%2C%0A%20%20%20%20%20%20%20%20first_pass%2C%0A%20%20%20%20%20%20%20%20preprocessor%2C%0A%20%20%20%20%20%20%20%20y_train%2C%0A%20%20%20%20)%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(r%22%22%22Finally%2C%20we%20can%20get%20a%20dataset%20for%20which%20we%20have%20the%20features%20but%20not%20the%20price%20and%20use%20our%20model%20to%20tell%20our%20trained%20model%20to%20compute%20a%20**Predicted%20Price**.%22%22%22)%0A%20%20%20%20return%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_(%0A%20%20%20%20GammaRegressor%2C%0A%20%20%20%20Pipeline%2C%0A%20%20%20%20StandardScaler%2C%0A%20%20%20%20X_train%2C%0A%20%20%20%20first_pass%2C%0A%20%20%20%20pd%2C%0A%20%20%20%20preprocessor%2C%0A%20%20%20%20y_train%2C%0A)%3A%0A%20%20%20%20car_test%20%3D%20pd.read_csv(%22.%2Fmodule-1%2Fdata%2Fcar-prices%2Ftest.csv%22)%0A%20%20%20%20car_test%20%3D%20first_pass(car_test)%0A%0A%20%20%20%20X_test%20%3D%20car_test.drop(columns%3D%5B%22ID%22%2C%20%22Price%22%5D)%0A%0A%20%20%20%20model%20%3D%20Pipeline(%5B%0A%20%20%20%20%20%20%20%20(%22preprocess%22%2C%20preprocessor)%2C%0A%20%20%20%20%20%20%20%20(%22scaler%22%2C%20StandardScaler(with_mean%3DFalse))%2C%0A%20%20%20%20%20%20%20%20(%22regressor%22%2C%20GammaRegressor(max_iter%3D1000))%0A%20%20%20%20%5D)%0A%0A%20%20%20%20model.fit(X_train%2C%20y_train)%0A%20%20%20%20y_predicted%20%3D%20model.predict(X_test)%0A%0A%20%20%20%20X_test.insert(1%2C%20%22Predicted%20Price%22%2C%20value%3Dy_predicted.astype(int))%0A%20%20%20%20X_test%0A%20%20%20%20return%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(%0A%20%20%20%20%20%20%20%20r%22%22%22%0A%20%20%20%20%23%23%20CRoss-Industry%20Standard%20Process%20for%20Data%20Mining%0A%0A%20%20%20%20%23%23%23%20The%20Problem%0A%0A%20%20%20%20Going%20back%20to%20the%20spam%20detection%20example%2C%20what%20we%20did%20was%3A%0A%0A%20%20%20%20*%20we%20defined%20our%20goal%20(to%20detect%20whether%20a%20message%20is%20spam%2C%20or%20not)%0A%20%20%20%20*%20we%20extracted%20some%20features%0A%20%20%20%20*%20we%20trained%20a%20model%0A%20%20%20%20*%20we%20used%20the%20model%20with%20test%20data%20to%20evaluate%20it%0A%0A%20%20%20%20These%20steps%20were%20a%20basic%20representation%20of%20whet%20the%20%22CRISP-DM%22%20methodology%20tries%20to%20solve.%0A%20%20%20%20%22%22%22%0A%20%20%20%20)%0A%20%20%20%20return%0A%0A%0A%40app.cell%0Adef%20_(mo)%3A%0A%20%20%20%20mo.image(%22.%2Fmodule-1%2Fassets%2Fcrisp-dm.jpeg%22)%0A%20%20%20%20return%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(%0A%20%20%20%20%20%20%20%20r%22%22%22%0A%20%20%20%20%23%23%23%20Business%20Understanding%0A%0A%20%20%20%20In%20a%20real%20world%20case%2C%20many%20departments%20of%20a%20big%20organization%20have%20participate%20in%20a%20data%20mining%20problem.%20The%20first%20step%20consists%20in%20uderstanding%20the%20problem%20that%20we%20are%20trying%20to%20solve%20but%20not%20from%20a%20technical%20point%20of%20view%20but%20from%20a%20business%20point%20of%20view%20instead.%20Actually%2C%20we%20shouldn't%20decide%20whether%20to%20start%20a%20Machine%20Learning%20problem%20until%20we%20really%20understand%20the%20business%20problem%20we%20are%20facing.%0A%0A%20%20%20%20At%20this%20step%2C%20the%20most%20important%20thing%20is%20to%20stablish%20measurable%20goals.%0A%0A%20%20%20%20%2F%2F%2F%20details%20%7C%20**Example%20Goal%20for%20the%20Spam%20Detection%20Problem**%0A%20%20%20%20%20%20%20%20type%3A%20info%0A%0A%20%20%20%20We%20want%20to%20reduce%20the%20number%20of%20spam%20messages%20to%20a%2050%25.%0A%20%20%20%20%2F%2F%2F%0A%20%20%20%20%22%22%22%0A%20%20%20%20)%0A%20%20%20%20return%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(%0A%20%20%20%20%20%20%20%20r%22%22%22%0A%20%20%20%20%23%23%23%20Data%20Understanding%0A%0A%20%20%20%20Once%20we%20understand%20the%20problem%20that%20we%20are%20trying%20to%20solve%2C%20we%20have%20to%20gather%20the%20data%20that%20we%20have%20available%20and%20make%20our%20best%20to%20understand%20it.%20At%20this%20step%2C%20we%20have%20to%20ask%20ourselves%20a%20few%20questions%3A%0A%0A%20%20%20%20*%20Where%20do%20the%20data%20come%20from%3F%0A%20%20%20%20*%20Is%20it%20reliable%3F%0A%20%20%20%20*%20Is%20the%20dataset%20big%20enough%3F%0A%20%20%20%20*%20Can%20we%20collect%20more%3F%0A%0A%20%20%20%20%2F%2F%2F%20details%20%7C%20**Example%20Questions%20for%20the%20Spam%20Detection%20Problem**%0A%20%20%20%20%20%20%20%20type%3A%20info%0A%0A%20%20%20%20Can%20we%20ask%20our%20users%20to%20mark%20incoming%20messages%20as%20spam%20(or%20not%20spam)%3F%0A%20%20%20%20%2F%2F%2F%0A%20%20%20%20%22%22%22%0A%20%20%20%20)%0A%20%20%20%20return%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(%0A%20%20%20%20%20%20%20%20r%22%22%22%0A%20%20%20%20%23%23%23%20Data%20Preparation%0A%0A%20%20%20%20To%20prepare%20the%20data%20so%20that%20it%20can%20be%20put%20into%20a%20Machine%20Learning%20algorithm%20involves%20several%20steps%3A%0A%0A%20%20%20%20*%20Extract%20some%20features%20from%20raw%20data%0A%20%20%20%20*%20Remove%20duplicated%20records%0A%20%20%20%20*%20Transform%20the%20data%20into%20numeric%20values%0A%20%20%20%20*%20Creating%20different%20splits%20for%20training%20and%20validation%0A%0A%20%20%20%20%2F%2F%2F%20details%20%7C%20**Example%20Preparations%20for%20the%20Spam%20Detection%20Problem**%0A%20%20%20%20%20%20%20%20type%3A%20info%0A%0A%20%20%20%20Does%20the%20subject%20contain%20more%20than%2025%20characters%3F%0A%20%20%20%20Does%20the%20sender%20contain%20%22mailinator%22%3F%0A%20%20%20%20%2F%2F%2F%0A%20%20%20%20%22%22%22%0A%20%20%20%20)%0A%20%20%20%20return%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(%0A%20%20%20%20%20%20%20%20r%22%22%22%0A%20%20%20%20%23%23%23%20Modeling%0A%0A%20%20%20%20At%20this%20step%20we%20choose%20different%20models%2C%20train%20them%20with%20our%20training%20dataset%20split%20and%20choose%20the%20best%20one%20according%20to%20some%20metrics.%0A%0A%20%20%20%20%2F%2F%2F%20details%20%7C%20**Example%20Models%20for%20the%20Spam%20Detection%20Problem**%0A%20%20%20%20%20%20%20%20type%3A%20info%0A%0A%20%20%20%20In%20the%20spam%20detection%20case%20we%20could%20choose%20between%20logistic%20regression%2C%20decision%20trees%2C%20neural%20networks%2C%20etc.%0A%20%20%20%20%2F%2F%2F%0A%20%20%20%20%22%22%22%0A%20%20%20%20)%0A%20%20%20%20return%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(%0A%20%20%20%20%20%20%20%20r%22%22%22%0A%20%20%20%20%23%23%23%20Evaluation%0A%0A%20%20%20%20Here%20we%20check%20if%20we%20managed%20to%20reach%20the%20goals%20that%20we%20stablished%20during%20the%20first%20step.%0A%0A%20%20%20%20*%20Was%20the%20goal%20achievable%3F%0A%20%20%20%20*%20Did%20we%20reach%20it%3F%0A%20%20%20%20*%20What%20can%20we%20do%20to%20get%20closer%20in%20the%20next%20iteration%3F%0A%20%20%20%20%22%22%22%0A%20%20%20%20)%0A%20%20%20%20return%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(%0A%20%20%20%20%20%20%20%20r%22%22%22%0A%20%20%20%20%23%23%23%20Deployment%0A%0A%20%20%20%20Finally%2C%20we%20deploy%20our%20new%20models%20so%20that%20they%20are%20accessible%20by%20our%20end%20users.%0A%0A%20%20%20%20*%20Deployment%20usually%20is%20tied%20to%20evaluation%20because%20there%20is%20no%20better%20evaluation%20than%20the%20one%20of%20end%20users.%0A%20%20%20%20%22%22%22%0A%20%20%20%20)%0A%20%20%20%20return%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(%0A%20%20%20%20%20%20%20%20r%22%22%22%0A%20%20%20%20%23%23%20Model%20Selection%20Process%0A%0A%20%20%20%20In%20this%20chapter%2C%20we%20are%20focusing%20on%20the%20**modeling**%20part%20that%20we%20described%20above.%0A%0A%20%20%20%20%23%23%23%20Train%20and%20Validation%20Splits%0A%0A%20%20%20%20The%20first%20and%20probably%20most%20important%20technique%20used%20to%20create%20models%20is%20to%20split%20our%20dataset%20into%20two%20different%20parts%3A%0A%0A%20%20%20%20*%20Around%20an%2080%25%20of%20the%20dataset%20will%20become%20our%20**train**%20dataset%0A%20%20%20%20*%20We'll%20keep%20the%20remaining%20part%20hidden%20from%20the%20model%20for%20**validation**%20purposes%0A%0A%20%20%20%20That%20will%20let%20us%20use%20the%20model%20to%20create%20predictions%20of%20cases%20that%20it%20has%20not%20seen%20during%20its%20training.%20As%20we%20have%20the%20correct%20answers%20for%20those%20cases%2C%20we'll%20be%20able%20to%20measure%20the%20differences%20between%20the%20correct%20answers%20and%20the%20answers%20generated%20by%20the%20model.%0A%20%20%20%20%22%22%22%0A%20%20%20%20)%0A%20%20%20%20return%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(%0A%20%20%20%20%20%20%20%20r%22%22%22%0A%20%20%20%20%23%23%23%20Multiple%20Comparison%20Problem%0A%0A%20%20%20%20When%20testing%20many%20models%20or%20hyperparameter%20settings%2C%20the%20chance%20of%20finding%20a%20good%20result%20just%20by%20luck%20increases%20with%20the%20number%20of%20models%20and%20different%20settings%20that%20we%20consider.%20So%20we%20may%20find%20models%20that%20perform%20well%20on%20one%20dataset%20but%20fail%20to%20generalize.%0A%0A%20%20%20%20A%20technique%20that%20helps%20with%20this%20is%20to%20split%20our%20dataset%20in%20three%20(not%20two)%20parts%3A%0A%0A%20%20%20%20*%20**Training**%20set%3A%20Used%20to%20fit%20the%20model%20parameters.%0A%20%20%20%20*%20**Validation**%20set%3A%20Used%20to%20tune%20hyperparameters%20and%20compare%20models%20(avoids%20overfitting%20directly%20to%20the%20training%20set).%0A%20%20%20%20*%20**Test**%20set%3A%20Held%20out%20until%20the%20very%20end%20to%20measure%20the%20true%20generalization%20performance.%0A%20%20%20%20%22%22%22%0A%20%20%20%20)%0A%20%20%20%20return%0A%0A%0Aif%20__name__%20%3D%3D%20%22__main__%22%3A%0A%20%20%20%20app.run()%0A