class: center, middle, inverse, title-slide .title[ # Introduction to Data Science ] .subtitle[ ## Session 1: What is data science? ] .author[ ### Simon Munzert ] .institute[ ### Hertie School ] --- class: inverse, center, middle name: welcome <style type="text/css"> @media print { # print out incremental slides; see https://stackoverflow.com/questions/56373198/get-xaringan-incremental-animations-to-print-to-pdf/56374619#56374619 .has-continuation { display: block !important; } } </style> # Welcome! <html><div style='float:left'></div><hr color='#EB811B' size=1px style="width:1000px; margin:auto;"/></html> --- # Introductions ### Course
https://github.com/intro-to-data-science-23 Much of this course lives on GitHub. You will find lecture materials, code, assignments, and other people's presentations there. We also have Moodle, which is is for everything else. -- ### Me
I'm [Simon Munzert](https://simonmunzert.github.io/) [si’mən munsɜrt], or just Simon [saɪmən].
[munzert@hertie-school.org](mailto:munzert@hertie-school.org)
Professor of Data Science and Public Policy | Director of the Data Science Lab -- ### You What's your name? And would you share a fun fact about yourself? --- # More about you <div align="center"> <br> <img src="pics/r-knowledge.png" height=500> </div> --- # More about you .pull-left[ <div align="center"> <br><b>MPP/MIA/PhD</b> <br> <img src="pics/wordcloud-mpp.png" height=400> </div> ] .pull-right[ <div align="center"> <br><b>MDS</b> <br> <img src="pics/wordcloud-mds.png" height=400> </div> ] --- # More about you .pull-left[ <div align="center"> <br><b>MPP/MIA/PhD</b> <br> <img src="pics/terms_freq_mpp.png" height=400> </div> ] .pull-right[ <div align="center"> <br><b>MDS</b> <br> <img src="pics/terms_freq_mds.png" height=400> </div> ] --- # The labs .pull-left-wide[ ## Who & how - This course is accompanied by labs administered by **Hiba Ahmad** and **Steve Kerr**. - The labs are mandatory (MDS) / optional (the rest). Please attend them in any case. - As with the regular classes, please stick to the lab you are assigned to. ## What for - What these sessions are meant for: - Applying tools in practice - Discussion of issues related to the assignments - Boosting your R skills - What these sessions are **not** meant for: - Solving the assignments for you - Taking care of developing your coding skills ] .pull-right-small-center[ <div align="center"> <br> <img src="pics/ahmad-circle.png" height=200> <br> <br> <img src="pics/kerr-circle.png" height=200> </div> ] --- # Class etiquette .pull-left-wide[ <br> - Learning how to code can be challenging and might lead you out of your comfort zone. If you have problems with the pace of the course, let me and the TAs know. I expect your commitment to the class, but **I do not want anyone to fail.** - You are all genuinely interested in data science. But there is also considerable variation in your backgrounds. This is how we like it! Some sessions will be more informative for you than others. If you feel bored, **look out for and help others**, or explore other corners of R you don't know yet. - The pandemic is still around, and other crises have emerged. We are affected by them in different ways. **Let's support each other.** - **Be respectful** to each other, all the time. This includes the TAs and me. - **Ask questions** whenever you feel the need to do so! ] .pull-right-small-center[ <div align="center"> <br><br> <img src="pics/stupid-questions.jpg" height=400> </div> ] --- # Table of contents </br></br> 1. [Welcome!](#welcome) 2. [What is data science?](#whatisdatascience) 3. [Sneak preview](#preview) 4. [Class logistics](#logistics) --- class: inverse, center, middle name: whatisdatascience # What is data science? <html><div style='float:left'></div><hr color='#EB811B' size=1px style="width:1000px; margin:auto;"/></html> --- # A classic definition of data science <div align="center"> <img src="pics/venn-orig-1.png" height=550> </div> --- # A classic definition of data science <div align="center"> <img src="pics/venn-orig-2.png" height=550> </div> --- # A classic definition of data science <div align="center"> <img src="pics/venn-orig-3.png" height=550> </div> --- # A classic definition of data science <div align="center"> <img src="pics/venn-orig-4.png" height=550> </div> --- # A classic definition of data science <div align="center"> <img src="pics/venn-orig-5.png" height=550> </div> --- # A classic definition of data science <div align="center"> <img src="pics/venn-orig-5-credit.png" height=550> </div> --- background-image: url("pics/vintage-pipeline.jpeg") background-size: contain background-color: #000000 # The data science pipeline --- # The data science pipeline .pull-left[ **Preparatory work** - **Problem definition** predict, infer, describe - **Design** conceptualize, build data collection device - **Data collection** recruit, collect, monitor ] --- # The data science pipeline .pull-left[ **Preparatory work** - **Problem definition** predict, infer, describe - **Design** conceptualize, build data collection device - **Data collection** recruit, collect, monitor **Data operation** ] .pull-right-center[ <br><br><br> <div align="center"> <img src="pics/data-science-flow.png" height=200> </div> ] --- # The data science pipeline .pull-left[ **Preparatory work** - **Problem definition** predict, infer, describe - **Design** conceptualize, build data collection device - **Data collection** recruit, collect, monitor **Data operation** - **Wrangle**: import, tidy, manipulate ] .pull-right-center[ <br><br><br> <div align="center"> <img src="pics/data-science-wrangle.png" height=200> </div> ] --- # The data science pipeline .pull-left[ **Preparatory work** - **Problem definition** predict, infer, describe - **Design** conceptualize, build data collection device - **Data collection** recruit, collect, monitor **Data operation** - **Wrangle**: import, tidy, manipulate - **Explore**: visualize, describe, discover ] .pull-right-center[ <br><br><br> <div align="center"> <img src="pics/data-science-explore.png" height=200> </div> ] --- # The data science pipeline .pull-left[ **Preparatory work** - **Problem definition** predict, infer, describe - **Design** conceptualize, build data collection device - **Data collection** recruit, collect, monitor **Data operation** - **Wrangle**: import, tidy, manipulate - **Explore**: visualize, describe, discover - **Model**: build, test, infer, predict ] .pull-right-center[ <br><br><br> <div align="center"> <img src="pics/data-science-model.png" height=200> </div> ] --- # The data science pipeline .pull-left[ **Preparatory work** - **Problem definition** predict, infer, describe - **Design** conceptualize, build data collection device - **Data collection** recruit, collect, monitor **Data operation** - **Wrangle**: import, tidy, manipulate - **Explore**: visualize, describe, discover - **Model**: build, test, infer, predict **Dissemination** - **Communicate**: to the public, media, policymakers - **Publish**: journals/proceedings, blogs, software - **Productize**: make usable, robust, scalable ] .pull-right-center[ <br><br><br> <div align="center"> <img src="pics/data-science-communicate.png" height=200> </div> ] --- # The data science pipeline .pull-left[ **Preparatory work** - **Problem definition** predict, infer, describe - **Design** conceptualize, build data collection device - **Data collection** recruit, collect, monitor **Data operation** - **Wrangle**: import, tidy, manipulate - **Explore**: visualize, describe, discover - **Model**: build, test, infer, predict **Dissemination** - **Communicate**: to the public, media, policymakers - **Publish**: journals/proceedings, blogs, software - **Productize**: make usable, robust, scalable **Meta skill: programming** ] .pull-right-center[ <br><br><br> <div align="center"> <img src="pics/data-science-program.png" height=200> </div> ] --- # The data science pipeline .pull-left[ **Preparatory work** - **Problem definition** predict, infer, describe - **Design** conceptualize, build data collection device - **Data collection** recruit, collect, monitor **Data operation** - **Wrangle**: import, tidy, manipulate - **Explore**: visualize, describe, discover - **Model**: build, test, infer, predict **Dissemination** - **Communicate**: to the public, media, policymakers - **Publish**: journals/proceedings, blogs, software - **Productize**: make usable, robust, scalable **Meta skill: programming with R** ] .pull-right-center[ <br><br> <div align="center"> <img src="pics/armyknife2.jpg" height=350> </div> ] --- class: inverse, center, middle name: preview # Sneak preview <html><div style='float:left'></div><hr color='#EB811B' size=1px style="width:1000px; margin:auto;"/></html> --- # Introduction to Data Science in a nutshell <div align="center"> <br> <img src="pics/ids-syllabus-2023.png" height=500> </div> --- class: inverse, center, middle # Sneak preview <br> Learning to love a programming environment <html><div style='float:left'></div><hr color='#EB811B' size=1px style="width:1000px; margin:auto;"/></html> --- class: center background-color: #fff # The tidyverse <div align="center"> <br> <img src="pics/hex-tidyverse.gif" height=500> </div> --- class: inverse, center, middle # Sneak preview <br> Collecting web data at scale <html><div style='float:left'></div><hr color='#EB811B' size=1px style="width:1000px; margin:auto;"/></html> --- # Scraping the web for social research .pull-left-center[ <div align="center"> <img src="pics/pap1.png" height=300> <br><br> <img src="pics/pap4.png" height=200> </div> ] .pull-right-center[ <div align="center"> <img src="pics/cld-paper.png" height=250> <br> <img src="pics/leakingnetworks.png" height=260> </div> ] --- class: inverse, center, middle # Sneak preview <br> Applying data science to tackle social problems <html><div style='float:left'></div><hr color='#EB811B' size=1px style="width:1000px; margin:auto;"/></html> --- class: center background-color: #fff # Tracking the usage of a contact tracing app .pull-left-center[ <div align="center"> <img src="pics/nhb-screenshot.png" height=300> <br><br> <img src="pics/corona-warn-app-android-ios.jpeg" height=200> </div> ] .pull-right-center[ <div align="center"> <img src="pics/cwa-experiment.png" height=250> <br> <img src="pics/cwa-timeline.png" height=260> </div> ] --- class: center background-color: #fff # Reducing hate speech on social media .pull-left-center[ <div align="center"> <img src="pics/munger-1.png" height=500> </div> ] .pull-right-center[ <div align="center"> <img src="pics/munger-2.png" height=250> <br> <img src="pics/munger-3.png" height=260> </div> ] --- class: center background-color: #fff # Monitoring the effects of climate change on health .pull-left-center[ <div align="center"> <img src="pics/lancet-2020.png" height=300> <br><br> <img src="pics/lancet-2020-fig1.png" height=200> </div> ] .pull-right-center[ <div align="center"> <img src="pics/clickstream_network_graph_health_climate_labels_2020.png" height=250> <br> <img src="pics/clickstream-views-2018-2020.png" height=260> </div> ] --- class: inverse, center, middle # Sneak preview <br> Getting to know the limits of data <html><div style='float:left'></div><hr color='#EB811B' size=1px style="width:1000px; margin:auto;"/></html> --- # Calling bullsh*t when you see it .pull-left-vsmall[ <br> **Learn not to be fooled** by - big data - garbage data - garbage models - weird samples - claims of generality - statistical significance - implausibly large effect sizes - highly precise forecasts - overfitted models And much more... ] .pull-right-vwide[ <br> <div align="center"> <img src="pics/sc01.png" height=130> <img src="pics/gigo.png" height=130> </br> <img src="pics/criminal-faces-2.png" height=130> <img src="pics/calling-bullshit-2.jpg" height=130> <img src="pics/survivorship-bias.png" height=130> </br> <img src="pics/cor-examples7.png" height=130> <img src="pics/overfitting.png" height=130> <img src="pics/nyt-opera.png" height=130> </div> ] --- class: inverse, center, middle # Sneak preview <br> Reflecting everyday ethics in data science <html><div style='float:left'></div><hr color='#EB811B' size=1px style="width:1000px; margin:auto;"/></html> --- class: center background-image: url("pics/clickworker.png") # How do I pay clickworkers fairly? --- class: center background-image: url("pics/webscraping2.jpeg") background-size: contain # How do I respect intellectual property? --- class: center background-image: url("pics/google-streetview.webp") background-size: contain background-color: #000000 # How do I protect the privacy of my research subjects? --- class: center background-image: url("pics/hate-online.jpeg") background-size: contain background-color: #000000 # How do I protect the safety of my research subjects? --- class: center background-image: url("pics/xkcd.png") background-size: contain background-color: #000000 # How do I ensure statistical, measurement validity, etc.? --- class: center background-image: url("pics/versioncontrol.jpeg") background-size: contain background-color: #000000 # How do I ensure an open science workflow? --- class: center background-image: url("pics/meeting-presentation.png") background-size: contain background-color: #000000 # How do I communicate results honestly? --- class: inverse, center, middle name: logistics # Class logistics <html><div style='float:left'></div><hr color='#EB811B' size=1px style="width:1000px; margin:auto;"/></html> --- # The plan .pull-left[ ### Goals of the course - This course equips you with conceptual knowledge about the data science pipeline and coding workflow, data structures, and data wrangling. - It enables you to apply this knowledge with statistical software. - It prepares you for our other core courses and electives as well as the master’s thesis. <div align="center"> <img src="pics/venn-focus.png" height=230> </div> ] -- .pull-right[ ### What we will cover - Version control and project management - R and the tidyverse - Programming workflow: debugging, automation, packaging - Relational databases and SQL - Web data and technologies - Model fitting and evaluation - Viusalization - Monitoring and communication - Data science ethics - (The command line) ] --- background-image: url("pics/harry-ron-hermione-start.jpeg") background-size: contain background-color: #000000 # You at the beginning of the course --- background-image: url("pics/harry-ron-hermione-end.jpeg") background-size: contain background-color: #000000 # You at the end of the course --- # Why R and RStudio? ### Data science positivism - Alongside Python, R has become the *de facto* language for data science. - See: [*The Impressive Growth of R*](https://stackoverflow.blog/2017/10/10/impressive-growth-r/), [*The Popularity of Data Science Software*](http://r4stats.com/articles/popularity/) - Open-source (free!) with a global user-base spanning academia and industry. - "Do you want to be a profit source or a cost center?" -- ### Bridge to multiple other programming environments, with statistics at heart - Already has all of the statistics support, and is amazingly adaptable as a “glue” language to other programming languages and APIs. - The RStudio IDE and ecosystem allow for further, seemless integration. -- ### Path dependency - It's also the language that I know best. - (Learning multiple languages is a good idea, though.) --- # Why R and RStudio? (cont.) <img src="01-introduction_files/figure-html/indeeddotcom-1.svg" style="display: block; margin: auto;" /> --- # Why R and RStudio? (cont.) <div align="center"> <img src="pics/facebook-programming-languages.png" width=600> </div> `Credit` [Left_Ad8361/Reddit](https://www.reddit.com/r/dataisbeautiful/comments/qw1bew/oc_which_programming_language_is_required_to_land/) --- # Core (and optional) readings <br><br><br> <div align="center"> <img src="pics/r4ds.jpeg" height=300> <img src="pics/advr2end.jpeg" height=300> <img src="pics/rpackages.jpeg" height=300> <img src="pics/bitbybit.png" height=300> <img src="pics/datasciencekelleher.jpeg" height=300> </div> --- # Attendance ## General rules - You cannot miss more than two sessions. If you have to miss a session for medical reasons or personal emergencies, please **inform Examination Office** and they will inform me about your absence. There is no need to notify me in advance or ex post. - We will check attendance on-site. - The current **Hertie hygiene rules** apply! --- # Office hours and advice - If you want to discuss content from class, please first do so in the lab sessions. - If you still need more feedback on course topics, use the Moodle forum. - If you want to discuss any other matters with me, drop Alex Karras, my assistant, a message (
[karras@hertie-school.org](mailto:karras@hertie-school.org)) and he will arrange a meeting. - For general technical advice, the [Research Consulting Team at the Data Science Lab](https://hertie-data-science-lab.github.io/research-consulting/) is there for you. --- # Assignments and grading | Component | Weight | |:-|-:| | 4(5) × homework assignments (10% each) | 40% | | 4(5) × online quizzes (5% each) | 20% | | 1 × workshop presentation/attendance | 10% | | 1 × hackathon project | 30% | -- ### Homework assignments - The assignments are distributed via our own [GitHub Classroom](https://classroom.github.com/classrooms/113288586-introduction-to-data-science-fall-2023). - Each assignment is a mix of practical problems that are to be solved with R. - You are encouraged to collaborate, but everyone will hand in a separate solution. - There will be 5 assignments (one every ~2 weeks; see [overview on GitHub](https://github.com/intro-to-data-science-23/assignments)) and the 4 best will contribute to the final grade. - You'll have one week to work on each assignment (deadline: Tuesdays at 9:30am). - You submit your solutions via GitHub. --- # Assignments and grading | Component | Weight | |:-|-:| | 4(5) × homework assignments (10% each) | 40% | | 4(5) × online quizzes (5% each) | 20% | | 1 × workshop presentation/attendance | 10% | | 1 × hackathon project | 30% | ### Homework assignments - Grades will be based on (1) the accuracy of your solutions and (2) the adherence of a clean and efficient coding style. - Feedback will be verbal: - Excellent (95+) - Very good (90-94) - Good (85-89) - OK (80-84) - Acceptable (75-79) - Definitely needs improvement (below 75) --- # Assignments and grading | Component | Weight | |:-|-:| | 4(5) × homework assignments (10% each) | 40% | | 4(5) × online quizzes (5% each) | 20% | | 1 × workshop presentation/attendance | 10% | | 1 × hackathon project | 30% | ### Online quizzes - The short online quizzes will test your knowledge about the topics covered in class. - There will be 5 quizzes and the 4 best will contribute to the final grade. - You'll have one week to work on each assignment (deadline: Tuesdays at 9:30am). --- # Assignments and grading | Component | Weight | |:-|-:| | 4(5) × homework assignments (10% each) | 40% | | 4(5) × online quizzes (5% each) | 20% | | 1 × workshop presentation/attendance | 10% | | 1 × hackathon project | 30% | ### Workshop presentation (MDS students) - On October 30, 14-20h, we will flip roles and you will become instructor of a data science workshop session. - You, in groups of 2 students, will present a data science workflow tool (randomly [allocated](https://github.com/intro-to-data-science-23/workshop-presentations)). - Your contribution will include: 1. A lightning talk (recorded) where you briefly introduce and motivate the tool 2. A hands-on session where you showcase the tool and provide practice material - Both the recorded talk and the materials will be graded. - Check out the materials from previous workshops online [`>2021<`](https://intro-to-data-science-21-workshop.github.io/) [`>2022<`](https://intro-to-data-science-22-workshop.github.io/)! - **MPP/MIA students**: You will not give a talk, but have to actively participate in the workshop. --- # Assignments and grading | Component | Weight | |:-|-:| | 4(5) × homework assignments (10% each) | 40% | | 4(5) × online quizzes (5% each) | 20% | | 1 × workshop presentation/attendance | 10% | | 1 × hackathon project | 30% | ### Hackathon project - On December 4, 17-20h, there will be a hackathon hosted at Hertie. - At the hackathon itself, we introduce the data and provide an environment that should facilitate you getting started with the project and form groups of 3-4 students. - Two weeks later, on December 18, the project instructions will be made available. You will then have 48 hours to submit your solutions. - The task is similar to the homework assignments but puts more emphasis on creative problem-solving using the tools and techniques you have learned in class. --- # AI use in and for the course .pull-left-wide[ ### Can AI tools (LLM interfaces, AI pair programming) be used for assignments? - Yes, but use them with care. You will not become an efficient programmer if you heavily rely on those tools without learning the basics. - The Hertie School has installed [teaching guidelines on the use of AI Tools](https://www.hertie-school.org/en/news/allcontent/detail/content/hertie-school-defines-guidelines-for-use-of-artificial-intelligence-software-at-university) in Spring 2023. We will stick to those guidelines. - Some key elements from the guidelines: - "Familiarity with AI tools is helpful for the learning experience and the professional development of students afterwards, ..." - "... but needs to be done with clear guidelines on ethical use, biases, and limits of the tools that are currently available." - "[T]he use of AI tools for the preparation of assignments (...) needs to be clearly referenced in the text." ] .pull-right-small-center[ <div align="center"> <br> <img src="pics/chatgpt-logo.png" height=200> <br> <br> <img src="pics/github-copilot-logo.jpeg" height=200> </div> ] --- # Further reading <div align="center"> <br><br><br> <img src="pics/bookofwhy.jpg" height=300> <img src="pics/everybodylies.jpg" height=300> <img src="pics/everythingisobvious.jpg" height=300> <img src="pics/calling-bullshit.jpeg" height=300> <img src="pics/wilkebook.jpeg" height=300> </div> --- # Further listening <div align="center"> <br> <img src="pics/podcast-hdsr.png" height=200> <img src="pics/podcast-data-skeptic.png" height=200> <img src="pics/podcast-data-stories.png" height=200> <img src="pics/podcast-digital-analytics-power-hour.jpeg" height=200> <img src="pics/podcast-linear-digressions.png" height=200> <br> <img src="pics/podcast-not-so-standard-deviations.png" height=200> <img src="pics/podcast-oreilly-datashow.jpeg" height=200> <img src="pics/podcast-banana-data.jpeg" height=200> <img src="pics/podcast-stats-stories.png" height=200> <img src="pics/podcast-talking-machines.png" height=200> </div> --- # Further watching <div align="center"> <br><br><br><br><br> <img src="pics/3blue1brown.jpeg" height=200> <img src="pics/statsquest.jpeg" height=200> <img src="pics/dataviz-heiss.png" height=200> <img src="pics/online-causal-inference.png" height=200> <img src="pics/civica-data-science.png" height=200> </div> --- # Getting started for the course ### Software 1. Download [R](https://cran.rstudio.com/). 2. Download [RStudio](https://www.rstudio.com/products/rstudio/download/preview/). 3. Download [Git](https://git-scm.com/downloads). 4. Create an account on [GitHub](https://github.com/) and register for a student/educator [discount](https://education.github.com/discount_requests/new). You will soon receive an invitation to the course organization on GitHub, as well as [GitHub classroom](https://classroom.github.com), which is how we'll disseminate and submit assignments, receive feedback and grading, etc. -- ### OS extras - **Windows:** Install [Rtools](https://cran.r-project.org/bin/windows/Rtools/). You might also want to install [Chocolately](https://chocolatey.org/). - **Mac:** Install [Homebrew](https://brew.sh/). - **Linux:** None (you should be good to go). --- # Checklist ☑ Do you have the most recent version of R? ```r R> version$version.string ``` ``` ## [1] "R version 4.3.1 (2023-06-16)" ``` ☑ Do you have the most recent version of RStudio? (The [preview version](https://www.rstudio.com/products/rstudio/download/preview/) is fine.) ```r R> RStudio.Version()$version R> ## Requires an interactive session but should return something like "[1] ‘1.4.1100’" ``` ☑ Have you updated all of your R packages? ```r R> update.packages(ask = FALSE, checkBuilt = TRUE) ``` --- # Checklist (cont.) Open up the [shell](http://happygitwithr.com/shell.html#shell). - Windows users, make sure that you installed a Bash-compatible version of the shell. If you installed [Git for Windows](https://gitforwindows.org), then you should be good to go. ☑ Which version of Git have you installed? ```bash $ git --version ``` ``` ## git version 2.37.1 (Apple Git-137.1) ``` ☑ Did you introduce yourself to Git? (Substitute in your details.) ```bash $ git config --global user.name 'Simon Munzert' $ git config --global user.email 'munzert@hertie-school.org' $ git config --global --list ``` ☑ Did you register an account in GitHub? --- # This semester <div align="center"> <br> <img src="pics/wickham-shitty-code.jpeg" height=400> </div> --- # Coming up <br><br><br> ### The first lab session Get to know Hiba, Steve, R, and RStudio, four of your best friends for the next months! ### Next lecture Version control and project management