--- title: "Introduction to Data Science" subtitle: "Session 1: What is data science?" author: "Simon Munzert" institute: "Hertie School | [GRAD-C11/E1339](https://github.com/intro-to-data-science-21)" #"`r format(Sys.time(), '%d %B %Y')`" output: xaringan::moon_reader: css: [default,'simons-touch.css', metropolis, metropolis-fonts] lib_dir: libs nature: highlightStyle: github highlightLines: true countIncrementalSlides: false ratio: '16:9' hash: true --- class: inverse, center, middle name: welcome ```{css, echo=FALSE} @media print { # print out incremental slides; see https://stackoverflow.com/questions/56373198/get-xaringan-incremental-animations-to-print-to-pdf/56374619#56374619 .has-continuation { display: block !important; } } ``` ```{r setup, include=FALSE} # figures formatting setup options(htmltools.dir.version = FALSE) library(knitr) opts_chunk$set( prompt = T, fig.align="center", #fig.width=6, fig.height=4.5, # out.width="748px", #out.length="520.75px", dpi=300, #fig.path='Figs/', cache=T, #echo=F, warning=F, message=F engine.opts = list(bash = "-l") ) ## Next hook based on this SO answer: https://stackoverflow.com/a/39025054 knit_hooks$set( prompt = function(before, options, envir) { options( prompt = if (options$engine %in% c('sh','bash')) '$ ' else 'R> ', continue = if (options$engine %in% c('sh','bash')) '$ ' else '+ ' ) }) library(tidyverse) library(hrbrthemes) library(fontawesome) ``` # Welcome!

--- # Introductions ### Course `r fa('globe')` https://github.com/intro-to-data-science-21 Much of this course lives on GitHub. You will find lecture materials, code, assignments, and other people's presentations there. We also have Moodle, which is is for everything else. -- ### Me `r fa('address-book')` I'm [Simon Munzert](https://simonmunzert.github.io/) [si’mən munsɜrt], or just Simon [saɪmən]. `r fa('envelope')` [munzert@hertie-school.org](mailto:munzert@hertie-school.org) `r fa('graduation-cap')` Assistant Professor (data science and public policy) -- ### You What's your name? Why are you here? And would you share a funny fact about yourself? --- # More about you

--- # More about you .pull-left[

MPP/MIA/PhD

] .pull-right[

MDS

] --- # More about you .pull-left[

MPP/MIA/PhD

] .pull-right[

MDS

] --- # The labs .pull-left-wide[ ## Who & how - This course is accompanied by labs administered by **Lisa Oswald** and **Tom Arend**. - The labs are mandatory. Please attend them. - As with the regular classes, please stick to the lab you are assigned to. ## What for - What these sessions are meant for: - Discussion of issues related to the content covered in the lecture - Discussion of issues related to the assignments - Boosting your R skills - What these sessions are **not** meant for: - Solving the assignments for you ] .pull-right-small-center[

] --- # Class etiquette .pull-left-wide[ - Learning programming with R can be challenging and might lead you out of your comfort zone. If you have problems with the pace of the course, let me know. I expect your commitment to the class, but **I do not want anyone to fail.** - You are all genuinely interested in data science. But there is also considerable variation in your backgrounds. This is how we like it! Some sessions will be more informative for you than others. If you feel bored, **look out for and help others**, or explore other corners of R you don't know yet. - The pandemic is still around. We are differentially affected by it. We are located in different time zones. **Let's support each other.** - **Be respectful** to each other, all the time. - **Ask questions** whenever you feel the need to do so! ] .pull-right-small-center[

] --- # Table of contents

1. [Welcome!](#welcome) 2. [What is data science?](#whatisdatascience) 3. [Sneak preview](#preview) 4. [Class logistics](#logistics) --- class: inverse, center, middle name: whatisdatascience # What is data science?

--- # An old classic

--- background-image: url("pics/vintage-pipeline.jpeg") background-size: contain background-color: #000000 # The data science pipeline --- # The data science pipeline .pull-left[ **Preparatory work** - **Problem definition** predict, infer, describe - **Design** conceptualize, build data collection device - **Data collection** recruit, collect, monitor ] --- # The data science pipeline .pull-left[ **Preparatory work** - **Problem definition** predict, infer, describe - **Design** conceptualize, build data collection device - **Data collection** recruit, collect, monitor **Data operation** ] .pull-right-center[

] --- # The data science pipeline .pull-left[ **Preparatory work** - **Problem definition** predict, infer, describe - **Design** conceptualize, build data collection device - **Data collection** recruit, collect, monitor **Data operation** - **Wrangle**: import, tidy, manipulate - **Explore**: visualize, describe, discover - **Model**: build, test, infer, predict **Dissemination** - **Communicate**: to the public, media, policymakers - **Publish**: journals/proceedings, blogs, software - **Productize**: make usable, robust, scalable ] .pull-right-center[

] --- class: inverse, center, middle name: preview # Sneak preview

--- class: inverse, center, middle # Sneak preview
Learning to love a programming environment

--- class: center background-color: #fff # The tidyverse

--- class: inverse, center, middle # Sneak preview
Collecting web data at scale

--- class: center background-color: #fff # Most popular websites 2019

--- # Scraping the web for social research .pull-left-center[

] .pull-right-center[

] --- class: inverse, center, middle # Sneak preview
Applying data science to tackle social problems

--- class: center background-color: #fff # Tracking the usage of a contact tracing app .pull-left-center[

] .pull-right-center[

] --- class: center background-color: #fff # Reducing hate speech on social media .pull-left-center[

] .pull-right-center[

] --- class: center background-color: #fff # Monitoring the effects of climate change on health .pull-left-center[

] .pull-right-center[

] --- class: inverse, center, middle # Sneak preview
Getting to know the limits of data

--- # Calling bullsh*t when you see it .pull-left-small[
Learn not to be fooled by - big data - garbage data - garbage models - weird samples - claims of generality - statistical significance - implausibly large effect sizes - highly precise forecasts - overfitted models And much more... ] .pull-right-wide-center[

] --- class: inverse, center, middle # Sneak preview
Reflecting everyday ethics in data science

--- class: center background-image: url("pics/clickworker.png") # How do I pay clickworkers fairly? --- class: center background-image: url("pics/webscraping2.jpeg") background-size: contain # How do I respect intellectual property? --- class: center background-image: url("pics/google-streetview.webp") background-size: contain background-color: #000000 # How do I protect the privacy of my research subjects? --- class: center background-image: url("pics/hate-online.jpeg") background-size: contain background-color: #000000 # How do I protect the safety of my research subjects? --- class: center background-image: url("pics/xkcd.png") background-size: contain background-color: #000000 # How do I ensure statistical, measurement validity, etc.? --- class: center background-image: url("pics/versioncontrol.jpeg") background-size: contain background-color: #000000 # How do I ensure an open science workflow? --- class: center background-image: url("pics/meeting-presentation.png") background-size: contain background-color: #000000 # How do I communicate results honestly? --- class: inverse, center, middle name: logistics # Class logistics

--- # The plan .pull-left[ ### Goals of the course - This course equips you with conceptual knowledge about the data science pipeline and coding workflow, data structures, and data wrangling. - It enables you to apply this knowledge with statistical software. - It prepares you for our other core courses and electives as well as the master’s thesis.

] -- .pull-right[ ### What we will cover - Version control and project management - R and the tidyverse - Relational databases and SQL - Web data and technologies - Model fitting and simulation - Viusalization - The command line - Programming workflow: debugging, automation, packaging - Monitoring and communication - Data science ethics ] ] --- background-image: url("pics/harry-ron-hermione-start.jpeg") background-size: contain background-color: #000000 # You at the beginning of the course --- background-image: url("pics/harry-ron-hermione-end.jpeg") background-size: contain background-color: #000000 # You at the end of the course --- # Why R and RStudio? ### Data science positivism - Alongside Python, R has become the *de facto* language for data science. - See: [*The Impressive Growth of R*](https://stackoverflow.blog/2017/10/10/impressive-growth-r/), [*The Popularity of Data Science Software*](http://r4stats.com/articles/popularity/) - Open-source (free!) with a global user-base spanning academia and industry. - "Do you want to be a profit source or a cost center?" -- ### Bridge to multiple other programming environments, with statistics at heart - Already has all of the statistics support, and is amazingly adaptable as a “glue” language to other programming languages and APIs. - The RStudio IDE and ecosystem allow for further, seemless integration. -- ### Path dependency - It's also the language that I know best. - (Learning multiple languages is a good idea, though.) --- # Why R and RStudio? (cont.) ```{R, indeeddotcom, echo = F, fig.height = 6, fig.width = 9, dev = "svg", warning = F} # The popularity data pop_df = data.frame( lang = c("Python", "SQL", "R", "SAS", "Matlab", "SPSS", "Stata"), n_jobs = c(8150, 7264, 3484, 950, 687, 270, 74), #n_jobs = c(99756, 132860, 41242, 31943, NA, 4843, 2302), free = c(T, T, T, F, F, F, F) ) ## Plot it pop_df %>% mutate(lang = lang %>% factor(ordered = T)) %>% ggplot(aes(x = lang, y = n_jobs, fill = free)) + geom_col() + geom_hline(yintercept = 0) + aes(x = reorder(lang, -n_jobs), fill = reorder(free, -free)) + xlab("Statistical language") + scale_y_continuous(label = scales::comma) + ylab("Number of jobs") + labs( title = "Comparing statistical languages", subtitle = "Number of job postings for ' data' on de.indeed.com, 2021/08/27" ) + scale_fill_manual( "Free?", labels = c("True", "False"), values = c("#f92672", "darkslategray") ) + ggthemes::theme_pander(base_size = 17) + # theme_ipsum() + theme(legend.position = "bottom") ``` --- # Why R and RStudio? (cont.)

˚Credit` [Left_Ad8361/Reddit](https://www.reddit.com/r/dataisbeautiful/comments/qw1bew/oc_which_programming_language_is_required_to_land/) --- # Core (and optional) readings

--- # Attendance ## General rules - Switching on-site groups will not be possible during the course of the semester. - However, if you happen to be unavailable on your slot but not on the other, please tune in (**only online!**). Also, please let Ayamba know (see below). - You cannot miss more than two sessions. If you have to miss a session for medical reasons or personal emergencies, please inform Examination Office. - We will check attendance on-site + online. If you attend online, type your full name in the chat when you log in. - For on-site attendance, the current Hertie hygiene rules apply! -- ## On-site rotation scheme - There is more interest in attending the course on-site than we have space. - I have set up a rotation scheme that tries to balance on-site attendance across participants, within groups. Check it out on Moodle. - If you are not listed as on-site participant for a particular session, please do not sneak into the on-site session. - **If you plan to attend online instead of on-site on short notice**, please drop Ayamba ([kwoyila@hertie-school.org](mailto:kwoyila@hertie-school.org)) a line until Wednesday noon so that she can inform successors. --- # Office hours and advice - If you want to discuss content from class, please first do so in the lab sessions. - If you still need more feedback, get in touch with me (`r fa('envelope')` [munzert@hertie-school.org](mailto:munzert@hertie-school.org)). - If you want to discuss any other matters with me, just drop me a message and we'll arrange a meeting. - My (virtual) office hours (by appointment) are Tuesdays, 3-4pm CET. - For general technical advice, the [Research Consulting Team at the Data Science Lab](https://hertie-data-science-lab.github.io/research-consulting/) is there for you. --- # Assignments and grading | Component | Weight | |:-|-:| | 4 × homework assignments (10% each) | 40% | | 1 × workshop presentation | 25% | | 1 × final data science project | 35% | -- ### Homework assignments - The assignments are distributed via our own [GitHub Classroom](https://classroom.github.com/classrooms/88543430-introduction-to-data-science). - Each assignment is a mix of practical problems that are to be solved with R. - You are encouraged to collaborate, but everyone will hand in a separate solution. - There will be 5 assignments (one every ~2 weeks) and the 4 best will contribute to the final grade. - You'll have roughly 2 weeks to work on each assignment. - You submit your solutions via GitHub Classroom. - Grades will be based on (1) the accuracy of your solutions and (2) the adherence of a clean and efficient coding style that you will learn in the first sessions. --- # Assignments and grading | Component | Weight | |:-|-:| | 4 × homework assignments (10% each) | 40% | | 1 × workshop presentation | 25% | | 1 × final data science project | 35% | ### Workshop presentation - About halfway through the semester, we will flip the roles and you will become the instructor for one session of a workshop on **Tools for Data Science**. Whether this will happen onsite (hybrid) or online will still be determined. - You, in (randomly allocated) groups of 2 students, will present a preexisting tool or package that is useful for the data science workflow. - Topics are listed on [GitHub](https://github.com/intro-to-data-science-21/workshop-presentations) and will be randomly allocated. - Your contribution will include 1. A lightning talk (recorded) where you briefly introduce and motivate the tool 2. A hands-on session where you showcase the tool and provide practice material - Both the talk and the prepared exercise materials, which must also be hosted openly on GitHub, will be graded. --- # Assignments and grading | Component | Weight | |:-|-:| | 4 × homework assignments (10% each) | 40% | | 1 × workshop presentation | 25% | | 1 × final data science project | 35% | ### Final data science project - In the component, to be submitted a couple of weeks after classes have finished, you will design and implement your own data science project. - You are supposed to collaborate in groups of two or three students. - Student groups choose their topic subject to approval by the instructor. - A wide variety of project types are imaginable and include 1. A report about a statistical data analysis 2. A policy explainer project enriched with data (think: a piece of data journalism) 3. A dashboard to make data / analyses accessible in an interactive fashion 4. An R package with a dedicated use case (e.g., API binding, visualization tools, etc.) --- # Further reading

--- # Further listening

--- # Further watching

--- # Getting started for the course ### Software 1. Download [R](https://www.r-project.org/). 2. Download [RStudio](https://www.rstudio.com/products/rstudio/download/preview/). 3. Download [Git](https://git-scm.com/downloads). 4. Create an account on [GitHub](https://github.com/) and register for a student/educator [discount](https://education.github.com/discount_requests/new). You will soon receive an invitation to the course organization on GitHub, as well as [GitHub classroom](https://classroom.github.com), which is how we'll disseminate and submit assignments, receive feedback and grading, etc. -- ### OS extras - **Windows:** Install [Rtools](https://cran.r-project.org/bin/windows/Rtools/). You might also want to install [Chocolately](https://chocolatey.org/). - **Mac:** Install [Homebrew](https://brew.sh/). - **Linux:** None (you should be good to go). --- # Checklist ☑ Do you have the most recent version of R? ```{r, cache=FALSE} version$version.string ``` ☑ Do you have the most recent version of RStudio? (The [preview version](https://www.rstudio.com/products/rstudio/download/preview/) is fine.) ```{r eval=FALSE} RStudio.Version()$version ## Requires an interactive session but should return something like "[1] ‘1.4.1100’" ``` ☑ Have you updated all of your R packages? ```{r eval=FALSE} update.packages(ask = FALSE, checkBuilt = TRUE) ``` --- # Checklist (cont.) Open up the [shell](http://happygitwithr.com/shell.html#shell). - Windows users, make sure that you installed a Bash-compatible version of the shell. If you installed [Git for Windows](https://gitforwindows.org), then you should be good to go. ☑ Which version of Git have you installed? ```{bash, cache=FALSE} git --version ``` ☑ Did you introduce yourself to Git? (Substitute in your details.) ```{bash eval=FALSE} git config --global user.name 'Simon Munzert' git config --global user.email 'munzert@hertie-school.org' git config --global --list ``` ☑ Did you register an account in GitHub? --- # Coming up

### The first lab session Wednesday is lab day. Get to know Lisa, Tom, R, and RStudio, four of your best friends for the next months! ### Next lecture Version control and project management