class: title-slide <br><br><br> # Lecture 1 ## What is Data Science? ### Tyler Ransom ### ECON 5253, University of Oklahoma --- # What is Data Science? - .hi[Data science (DS):] The scientific discipline that deals with transforming data into useful information ("insights") using a variety of stats/ML techniques - Amazon: Collects data on search history, cart history, purchases - Analyzes the data to estimate users' willingness to pay for various products (including Prime); recommend new products - The rise of data science has come because of the so-called Big Data revolution - The rise of the internet in the late-1990s and 2000s `\(\Rightarrow \,\uparrow\)`opportunities for companies and governments to collect data on consumers & citizens - Proliferation of mobile devices & social media from late 2000s until now has generated even more data --- # Skills required for data science .center[ <img src="skills.jpg" width="75%" /> Source: [NC State Univ.](http://sites.nationalacademies.org/cs/groups/cstbsite/documents/webpage/cstb_181680.pdf) (p. 26) ] --- # Pillars of data science - Programming (for automation of data collection, manipulation, cleaning, visualization, and modeling) - Visualization & exploration - Machine learning (to select models, compress data) - Causal inference (to be able to make a policy prescription) ...Assuming one has the appropriate foundation of basic calculus and statistics --- # The data science workflow .center[ <img src="https://d33wubrfki0l68.cloudfront.net/571b056757d68e6df81a3e3853f54d3c76ad6efc/32d37/diagrams/data-science.png" width="85%" /> Source: [R for Data Science](http://r4ds.had.co.nz/introduction.html) ] --- # Big Data .center[ <img src="frisch.jpg" width="90%" /> ] --- # Big Data .center[ <img src="frisch.jpg" width="90%" /> ] Source: Frisch, Ragnar. 1933. "Editor's Note" _Econometrica_ 1(1): 1--4 --- # What is Big Data? It depends on who you ask. It could mean: 1. "Wild" data (unstructured; collected without a particular intention; e.g. twitter, contrast with Census surveys) 2. "Wide" data (a.k.a. "Big-K" data because `\(K>N\)`) 3. "Long" data (a.k.a. "Big-N" data because `\(N\)` very, very large [and may not all fit onto a single hard drive!]) 4. Any data set that cannot be analyzed with classical methods like OLS (e.g. all combinations of the above three types) "Big Data" not so much about size of data, but about whether or not "small data" (read: classical) methods can be used --- # What is machine learning? What is AI? - .hi[Machine learning (ML):] Allowing computers to learn for themselves without explicitly being programmed - USPS: Computer to read handwriting on envelopes - Google: AlphaGo, computer that defeated world champion Go player - Apple/Amazon/Microsoft: Siri, Alexa, Cortana voice assistants - .hi[Artificial intelligence (AI):] Constructing machines (robots, computers) to think and act like human beings - ML is a subset of AI --- # Big data & machine learning - You'll often hear the phrase "big data and machine learning" - This is because many machine learning algorithms are helpful for big data problems: - Selecting which `\(k<K\)` covariates should enter your model - Streamlined techniques for processing "wild" data - New modeling approaches that can leverage the greater amount of information that Big Data has --- # Correlation vs. causation - Machine learning is not the end-all, be-all of data science - A good data scientist knows that correlation is not causation! - Ultimately companies want "insights" that they can use to `\(\uparrow\)`profits - Tech co.'s run randomized experiments to estimate causal effects - Can also estimate fancier statistical models to account for selection - Economists' comparative advantage is in combining machine learning with economic theory to produce optimal policies --- # Example > One very classic example comes from looking at the data of a shopping cart. Why do sales of beer and diapers go hand in hand? The correlation is women tell their husbands to go pick up diapers, and on the way, they pick up beer, too. That is data science: finding trends from your data and using that insight to increase your sales or market better Source: http://www.chicagotribune.com/bluesky/originals/ct-bsi-inside-job-4c-insights-20171002-story,amp.html --- # Lifestyle - What's it like to have a DS job right now: https://www.simplilearn.com/a-day-in-the-life-of-a-data-scientist-article - Data scientist job profiles: https://www.mygreatlearning.com/blog/different-data-science-jobs-roles-industry/ --- # Generative AI - I strongly encourage you to use Generative AI in all aspects of this class - My favorite is Claude, but ChatGPT and Gemini are equally capable - I believe Gemini Pro is free for students - These tools are powerful enough to do all your work in trivial amounts of time - .hi[Your challenge will be to force yourself to actually learn the concepts and to not use AI as a crutch] --- # Claude example from PS3 later in the semester **Me: (question 5 from PS3)** Write a SQL script that does the following: (a) Read in the Florida insurance data CSV file (b) Print out the first 10 rows of the data set (c) List which counties are in the sample (i.e. list unique values of the county variable) (d) Compute the average property appreciation from 2011 to 2012 (i.e. compute the mean of tiv_2012 - tiv_2011) (e) Create a frequency table of the construction variable to see what fraction of buildings are made out of wood or some other material --- # Claude example (continued) **Claude:** Let me check if there's a standard Florida insurance dataset I should work with, and create a SQL script that will work with the typical structure of this dataset: [looks for uploaded files] [searches web for "Florida insurance data CSV file"] [finds it on GitHub] --- # Claude example (continued) **Claude:** "I can see the structure of the Florida insurance dataset. The columns include: [...]. Now I'll create a SQL script using SQLite (which is commonly available) to accomplish all the requested tasks:" .scroll-box-12[ ```sql -- ============================================================================ -- Florida Insurance Data Analysis SQL Script -- ============================================================================ -- This script performs the following analyses on the FL_insurance_sample.csv: -- (a) Read in the Florida insurance data CSV file -- (b) Print out the first 10 rows of the data set -- (c) List unique counties in the sample -- (d) Compute average property appreciation from 2011 to 2012 -- (e) Create a frequency table of the construction variable -- ============================================================================ -- Note: This script is written for SQLite. For other database systems, -- syntax may vary slightly (e.g., for CSV import). -- ============================================================================ -- (a) Read in the Florida insurance data CSV file -- ============================================================================ -- First, create the table structure to match the CSV columns DROP TABLE IF EXISTS florida_insurance; CREATE TABLE florida_insurance ( policyID INTEGER, statecode TEXT, county TEXT, eq_site_limit REAL, hu_site_limit REAL, fl_site_limit REAL, fr_site_limit REAL, tiv_2011 REAL, tiv_2012 REAL, eq_site_deductible REAL, hu_site_deductible REAL, fl_site_deductible REAL, fr_site_deductible REAL, point_latitude REAL, point_longitude REAL, line TEXT, construction TEXT, point_granularity INTEGER ); -- Import the CSV file (SQLite syntax) -- Note: Run this from the SQLite command line with: -- .mode csv -- .import FL_insurance_sample.csv florida_insurance -- (You may need to delete the header row after import or use --skip 1) -- Alternative: For SQLite 3.32+, you can use: -- .import --csv --skip 1 FL_insurance_sample.csv florida_insurance -- ============================================================================ -- (b) Print out the first 10 rows of the data set -- ============================================================================ SELECT '=== PART (b): First 10 Rows of the Dataset ===' AS section; SELECT policyID, statecode, county, eq_site_limit, hu_site_limit, fl_site_limit, fr_site_limit, tiv_2011, tiv_2012, eq_site_deductible, hu_site_deductible, fl_site_deductible, fr_site_deductible, point_latitude, point_longitude, line, construction, point_granularity FROM florida_insurance LIMIT 10; -- ============================================================================ -- (c) List which counties are in the sample (unique values of county) -- ============================================================================ SELECT '=== PART (c): Unique Counties in the Sample ===' AS section; SELECT DISTINCT county FROM florida_insurance ORDER BY county; -- Also show the count of unique counties SELECT COUNT(DISTINCT county) AS total_unique_counties FROM florida_insurance; -- ============================================================================ -- (d) Compute the average property appreciation from 2011 to 2012 -- i.e., compute the mean of (tiv_2012 - tiv_2011) -- ============================================================================ SELECT '=== PART (d): Average Property Appreciation (2011 to 2012) ===' AS section; SELECT AVG(tiv_2012 - tiv_2011) AS avg_appreciation, -- Additional summary statistics for context: MIN(tiv_2012 - tiv_2011) AS min_appreciation, MAX(tiv_2012 - tiv_2011) AS max_appreciation, SUM(tiv_2012 - tiv_2011) AS total_appreciation, COUNT(*) AS n_observations FROM florida_insurance; -- Also compute mean TIV values for context SELECT AVG(tiv_2011) AS avg_tiv_2011, AVG(tiv_2012) AS avg_tiv_2012, AVG(tiv_2012) - AVG(tiv_2011) AS difference_of_means FROM florida_insurance; -- ============================================================================ -- (e) Create a frequency table of the construction variable -- to see what fraction of buildings are made of wood or other material -- ============================================================================ SELECT '=== PART (e): Frequency Table of Construction Types ===' AS section; -- Frequency table with counts and percentages SELECT construction, COUNT(*) AS frequency, ROUND(COUNT(*) * 100.0 / (SELECT COUNT(*) FROM florida_insurance), 2) AS percentage FROM florida_insurance GROUP BY construction ORDER BY frequency DESC; -- Summary showing Wood vs. Other materials SELECT '=== Wood vs. Other Materials ===' AS section; SELECT CASE WHEN construction = 'Wood' THEN 'Wood' ELSE 'Other Material' END AS material_category, COUNT(*) AS frequency, ROUND(COUNT(*) * 100.0 / (SELECT COUNT(*) FROM florida_insurance), 2) AS percentage FROM florida_insurance GROUP BY material_category ORDER BY frequency DESC; -- ============================================================================ -- END OF ANALYSIS -- ============================================================================ ``` ] --- # Grading - Because Generative AI has such a vast potential to help you, I will grade problem sets much more strictly than I have in the past - If your code doesn't run, you won't get credit for it - You should always make sure your code will run start-to-finish and with AI as a resource, there is no excuse to not turn in perfect code --- # Final project - A large portion of your grade will come from a final project - You should find a research question you're interested in and some data you can use to answer it - If you are short on ideas, check out https://www.kaggle.com/datasets which routinely hosts analytics competitions - Kaggle publishes data for each competition - Main criterion for the project is to apply skills developed in this class - If you have more questions, please come see me