class: center, middle, inverse, title-slide .title[ # Topic 5: Data Cleaning ] .author[ ### Nick Hagerty
ECNS 460/560
Montana State University ] --- name: toc <style type="text/css"> # CSS for including pauses in printed PDF output (see bottom of lecture) @media print { .has-continuation { display: block !important; } } .small { font-size: 90%; } .smaller { font-size: 80%; } .xsmall { font-size: 50% } </style> # Table of contents 1. [Another join example](#join) 1. [Keys and relational data](#keys) 1. [Number storage](#numbers) 1. [String cleaning](#strings) 1. [Data Cleaning Checklist](#checklist) --- class: inverse, middle name: join # Another join example --- # Join example Let's try something. Load the **nycflights13** package again. I want to calculate the average air time, distance, and number of seats for flights in this dataset. Number of seats is in the `planes` dataframe, so I have to join `planes` to `flights`. Here is my code and output: ```r flights_planes = inner_join( flights, planes |> rename(year_built = year), by = "tailnum" ) flights_planes |> summarize(across(c(air_time, distance, seats), \(x) mean(x, na.rm=T))) ``` ``` ## # A tibble: 1 × 3 ## air_time distance seats ## <dbl> <dbl> <dbl> ## 1 154. 1069. 137. ``` Can you find the problem here? Try to figure out how it arises and why. --- # Correcting the problem ```r inner_flights_planes = flights |> inner_join(planes |> rename(year_built = year), by="tailnum") inner_flights_planes |> summarize(across(c(air_time, distance, seats), \(x) mean(x, na.rm=T))) ``` ``` ## # A tibble: 1 × 3 ## air_time distance seats ## <dbl> <dbl> <dbl> ## 1 154. 1069. 137. ``` ```r left_flights_planes = flights |> left_join(planes |> rename(year_built = year), by="tailnum") left_flights_planes |> summarize(across(c(air_time, distance, seats), \(x) mean(x, na.rm=T))) ``` ``` ## # A tibble: 1 × 3 ## air_time distance seats ## <dbl> <dbl> <dbl> ## 1 151. 1040. 137. ``` Fixing it is the easy part. Why do the joins produce different answers? --- # Diagnosing the problem How many observations do we have in each dataset? ```r nrow(flights) ``` ``` ## [1] 336776 ``` ```r nrow(inner_flights_planes) ``` ``` ## [1] 284170 ``` ```r nrow(left_flights_planes) ``` ``` ## [1] 336776 ``` * Not all rows of `flights` matched rows in `planes`. * Inner join completely dropped these rows, even though they still had useful data for `air_time` and `distance`. * So the mean was taken over a different set of observations after the inner join than after the left join. --- # Diagnosing the problem Why exactly did these rows not match? Let's take a closer look at the variable we joined by, `tailnum`, in the original `flights` dataset: ```r summary(flights$tailnum) ``` ``` ## Length Class Mode ## 336776 character character ``` That doesn't tell us much, since `tailnum` is a character variable. --- # Diagnosing the problem Why exactly did these rows not match? Let's check more explicitly for missing values: ```r flights |> mutate(na_tailnum = is.na(tailnum)) |> count(na_tailnum) ``` ``` ## # A tibble: 2 × 2 ## na_tailnum n ## <lgl> <int> ## 1 FALSE 334264 ## 2 TRUE 2512 ``` Hmm. Some rows have missing values, but 334k are non-missing, which is more than the 284k rows remaining in the inner-joined dataframe. * So about 50,000 of these flights must have values of `tailnum` that just aren't found in `planes`. --- # Diagnosing the problem To see what happened even more carefully, we can add a `keep=T` option when we join. ```r flights_planes = flights |> left_join(planes |> rename(year_built = year), by="tailnum", keep=TRUE ) # How many rows of flights matched a row in planes? flights_planes |> filter(tailnum.x == tailnum.y) |> nrow() ``` ``` ## [1] 284170 ``` ```r # How many rows failed to match because tailnum was missing in flights? flights_planes |> filter(is.na(tailnum.x)) |> nrow() ``` ``` ## [1] 2512 ``` ```r # How many rows failed to match because tailnum was not found in planes? flights_planes |> filter(!is.na(tailnum.x) & is.na(tailnum.y)) |> nrow() ``` ``` ## [1] 50094 ``` --- # Diagnosing the problem Why is the mean of `seats` not affected? -- .pull-left[ ```r inner_flights_planes |> mutate(na_seats = is.na(seats)) |> count(na_seats) ``` ``` ## # A tibble: 1 × 2 ## na_seats n ## <lgl> <int> ## 1 FALSE 284170 ``` ] .pull-right[ ```r left_flights_planes |> mutate(na_seats = is.na(seats)) |> count(na_seats) ``` ``` ## # A tibble: 2 × 2 ## na_seats n ## <lgl> <int> ## 1 FALSE 284170 ## 2 TRUE 52606 ``` ] * `seats` does have missing values in the left-joined dataset, from rows of `flights` that did not match `planes`. * But the non-missing values are the same ones, and when we took means, we specifically removed the missing values using `na.rm=T`. --- # A note on Stata (If you don't use Stata and have no plans to, you can ignore this slide!) Already we can see some of the biggest advantages of R over Stata: * Holds multiple dataframes in memory at the same time -- allowing us to easily examine, compare, and switch between them. * In Stata, you have to constantly use tempfiles along with `save` and `use` * Or sometimes you can get by with `preserve`/`restore` * Holds multiple versions/stages of your data in memory at the same time. * In Stata, you cannot go back and look at an intermediate stage of your data without re-running everything. * You can easily see all of your objects in the Environment tab. * Stata can hold non-dataframe variables (locals and globals, collectively "macros") * But it does not make them easy to find --- # A note on Stata (If you don't use Stata and have no plans to, you can ignore this slide!) Already we can see some of the biggest advantages of R over Stata: * Missing values are treated in a safe and sensible way. * NA's propagate through calculations unless you specifically choose to set `na.rm=TRUE`. * In Stata, missing values equal `\(+\infty\)` (with `. < .a < .b < .c < ...`). * In Stata, `drop if x > 1000` will **drop all observations with missing values of x!** * (To fix this, ALWAYS add `drop if x>1000 & !missing(x)` to all your `if` conditions.) --- class: inverse, middle name: keys # Keys & Relational Data Images in this section are from ["R for Data Science"](https://r4ds.had.co.nz/relational-data.html) by Wickham & Grolemund, used under [CC BY-NC-ND 3.0](https://creativecommons.org/licenses/by-nc-nd/3.0/us/) and not included under this resource's overall CC license. --- # Relational data **Relational data**: multiple tables of data that have relationships to each other (i.e., tables that you would ever consider joining). * Relational data is most often discussed in the context of *databases* like SQL. * But we also use it when doing any form of data wrangling that involves joins, so it's helpful to think more formally about it. --- # Relationships in nycflights <img src="img/relational-nycflights.png" width="60%" style="display: block; margin: auto;" /> - `flights` connects to `planes` via a single variable, `tailnum`. - `flights` connects to `airlines` through the `carrier` variable. - `flights` connects to `airports` in two ways: via the `origin` and `dest` variables. - `flights` connects to `weather` via `origin` (the location), and `year`, `month`, `day` and `hour` (the time). What is the relationship between `weather` and `airports`? --- # Keys A **key** is a variable (or set of variables) that uniquely identifies an observation. * In `planes`, the key is `tailnum`. * In `weather`, the key consists of 5 variables: (`year`, `month`, `day`, `hour`, `origin`). There are two types of keys: * A **primary key** uniquely identifies an observation in its own data frame. * `planes$tailnum` is a primary key because it uniquely identifies each plane in the `planes` data frame. * A **foreign key** uniquely identifies an observation in another data frame. * `flights$tailnum` is a foreign key because it appears in the `flights` data frame where it matches each flight to a unique plane. A variable can be both a primary key *and* a foreign key. For example, `origin` is part of the `weather` primary key, and is also a foreign key for the `airports` data frame. --- # Keys The primary key is the **first thing** you need to know about a new data frame. Once you think you know the primary key, **verify it.** Here's one way to do that: ```r planes |> count(tailnum) |> filter(n > 1) ``` ``` ## # A tibble: 0 × 2 ## # … with 2 variables: tailnum <chr>, n <int> ## # ℹ Use `colnames()` to see all variable names ``` -- Here's another: ```r nrow(planes) ``` ``` ## [1] 3322 ``` ```r nrow(planes |> distinct(tailnum)) ``` ``` ## [1] 3322 ``` --- # Keys You can write a **unit test** into your code to make sure this is true before proceeding: ```r dups_planes = planes |> count(tailnum) |> filter(n > 1) stopifnot(nrow(dups_planes) == 0) dups_weather = weather |> count(year, month, day, hour, origin) |> filter(n > 1) stopifnot(nrow(dups_weather) == 0) ``` ``` ## Error: nrow(dups_weather) == 0 is not TRUE ``` -- Alternatively: ```r stopifnot(nrow(planes) == nrow(planes |> distinct(tailnum))) stopifnot(nrow(weather) == nrow(weather |> distinct(year, month, day, hour, origin))) ``` ``` ## Error: nrow(weather) == nrow(distinct(weather, year, month, day, hour, .... is not TRUE ``` --- # Surrogate keys What's the primary key in the `flights` data frame? Take a minute to investigate/verify. -- You might think it would be the date + the carrier + the flight or tail number, but neither of those are unique: ```r flights |> count(year, month, day, carrier, flight) |> filter(n > 1) ``` ``` ## # A tibble: 24 × 6 ## year month day carrier flight n ## <int> <int> <int> <chr> <int> <int> ## 1 2013 6 8 WN 2269 2 ## 2 2013 6 15 WN 2269 2 ## 3 2013 6 22 WN 2269 2 ## 4 2013 6 29 WN 2269 2 ## 5 2013 7 6 WN 2269 2 ## 6 2013 7 13 WN 2269 2 ## 7 2013 7 20 WN 2269 2 ## 8 2013 7 27 WN 2269 2 ## 9 2013 8 3 WN 2269 2 ## 10 2013 8 10 WN 2269 2 ## # ℹ 14 more rows ``` --- # Surrogate keys If a data frame lacks a primary key, but it is tidy (each row is an observation), it's often useful to add in **surrogate key**: ```r flights2 = flights |> arrange(year, month, day, carrier, flight, sched_dep_time) |> mutate(id = row_number()) |> relocate(id) flights2 ``` ``` ## # A tibble: 336,776 × 20 ## id year month day dep_time sched_dep_time dep_delay arr_time ## <int> <int> <int> <int> <int> <int> <dbl> <int> ## 1 1 2013 1 1 1825 1829 -4 2056 ## 2 2 2013 1 1 1452 1455 -3 1637 ## 3 3 2013 1 1 2015 2005 10 2149 ## 4 4 2013 1 1 1637 1545 52 1858 ## 5 5 2013 1 1 1939 1840 59 29 ## 6 6 2013 1 1 1554 1600 -6 1701 ## 7 7 2013 1 1 1546 1540 6 1753 ## 8 8 2013 1 1 2115 1700 255 2330 ## 9 9 2013 1 1 2023 1945 38 2240 ## 10 10 2013 1 1 2046 2035 11 2144 ## # ℹ 336,766 more rows ## # ℹ 12 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>, ## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, ## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm> ``` --- # Relations A primary key and the corresponding foreign key in another data frame form a **relation**. In general, relations are **1-to-many**: Each flight has one plane, but each plane has many flights. * Sometimes you'll see a **1-to-1** relation, but you can think of this as a special case of 1-to-many. * You can also find **many-to-many** relations, but you can think of these as two 1-to-many relations going in each direction. * There's a many-to-many relationship between airlines and airports: each airline flies to many airports; each airport hosts many airlines. </br> **Note on Stata: NEVER USE `merge m:m`. JUST DON'T DO IT.** There is no scenario in which it will give you what you want. This syntax should not exist. If you are tempted, you are probably either confused or looking for `joinby`. --- # Relations `join` does **not** think about whether your key is unique, or what type of relation you have. Instead, it simply returns all possible combinations of observations in your two dataframes: <img src="img/join-one-to-many.png" width="50%" style="display: block; margin: auto;" /> <img src="img/join-many-to-many.png" width="60%" style="display: block; margin: auto;" /> --- # Duplicate keys What if you join by a key that is not actually unique, when you think it is? You'll get extra rows: ```r flights_weather = flights |> left_join(weather, by = c("year", "month", "day", "origin")) ``` ``` ## Warning in left_join(flights, weather, by = c("year", "month", "day", "origin")): Detected an unexpected many-to-many relationship between `x` and `y`. ## ℹ Row 1 of `x` matches multiple rows in `y`. ## ℹ Row 8704 of `y` matches multiple rows in `x`. ## ℹ If a many-to-many relationship is expected, set `relationship = ## "many-to-many"` to silence this warning. ``` ```r nrow(flights_weather) ``` ``` ## [1] 8036575 ``` Now you no longer have a dataframe of flights. ```r nrow(flights) ``` ``` ## [1] 336776 ``` --- # A better way to join Here's an example of a good (safe) way to join `flights` and `planes`: ```r # Confirm that tailnum is the primary key (unique ID) of planes stopifnot(nrow(planes) == nrow(planes |> distinct(tailnum))) # Join, keeping the join keys from both datasets flights_planes = flights |> left_join(planes |> rename(year_built = year), by = "tailnum", keep = TRUE) |> rename(tailnum = tailnum.x, tailnum_planes = tailnum.y) # Confirm the join was 1:many stopifnot(nrow(flights) == nrow(flights_planes)) ``` --- class: inverse, middle name: numbers # Number storage --- # Variable types What types of variables might you encounter in a raw dataset? * Quantitative variables (`numeric` or `integer`). * Dates and times (a special, reformatted type of integer). * Categorical variables (`factor`). - Values are discrete and have a specific interpretation. * Binary variables (`logical` or `integer`). - TRUE/FALSE or 0/1. - Can be thought of as a special type of categorical variable. * Strings (`character`) - A sequence of letters, numbers, symbols, or other characters. --- # Floating point problems Simplify this expression: `\(1-\frac{1}{49}*49\)` It's obviously 0. Now ask R: ```r 1 - (1/49)*49 ``` -- ``` ## [1] 1.110223e-16 ``` This is called a **floating point** problem. It arises from the way computers store numbers. R doesn't notice that `\(49/49\)` simplifies to 1. It just follows the order of operations. So the first thing it does is calculate: ```r (1/49) ``` ``` ## [1] 0.02040816 ``` Which is an irrational number. So R rounds it to 53 significant digits before multiplying by 49. --- # Floating point problems Most of the time, 53 digits is plenty of precision. But sometimes it creates problems. Note: This explanation is actually too simple. The floating-point issue goes deeper than just irrational numbers. Here's another example: ```r 1 - 0.9 - 0.1 ``` ``` ## [1] -2.775558e-17 ``` -- In 1996, a floating-point error caused a European Space Agency rocket to self-destruct 37 seconds after it was launched. --- # Avoiding floating-point errors Pay attention to the data type of your variables. Avoid using logical conditions like `height == 180` for numeric variables. * `height` may even read as `180` in the `View` window * But under the hood, it might still be stored as `180.000000000000173...`. What you can do instead: * **Best option:** `dplyr::near` compares numbers with a built-in tolerance. * Use `>` and `<` comparisons, or `between(height, 179.9, 180.1)`. * Convert in place: `as.integer(height) == 180` * Or with finer control: `round(height, digits=6) == 180` * If all values are integers, store the variable as an integer in the first place. --- # How to store a number? **Numeric** variables are stored in scientific notation. * Use to represent a single value, for which digits decrease in importance from left to right. * Example: My height is `182.2469405113283` cm. **Integer** variables lack decimal places. * Saves memory relative to numeric variables. * Stores values exactly, avoiding some floating-point problems. **Character** (string) variables store the full sequence of digits literally. * Use when digits lack quantitative information, and each digit is equally important. * Phone numbers, credit card numbers, etc. * No chance of the right-most digits getting lost or corrupted. --- # More variable formats **Dates and times** allow you to easily do math and logic on dates and times. * See tidyverse package `lubridate`. **Factors** allow you to store values as numbers, but *display* them as strings. * This is useful for sorting things like month names: "Jan", "Feb", "Mar", "Apr".... * See tidyverse packages `forcats`. --- # Memory space Memory space quickly becomes a problem when you work with large datasets. * But R does a reasonably good job of handling storage efficiently. Logical variables are smaller than integers, which are smaller than numeric. Does it save memory to store a variable as a factor instead of a string? * This used to be true: factor variables only store the factor labels once. * But no longer: R uses a global string pool -- each unique string is only stored once. `pryr::object_size()` will tell you how much memory an object takes up (accounting for shared elements within an object). --- class: inverse, middle name: strings # String cleaning Parts of this section are adapted from [“Introduction to Data Science”](http://rafalab.dfci.harvard.edu/dsbook/string-processing.html) by Rafael A. Irizarry, used under [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0). --- # String cleaning example Let's clean this raw data, the result of a web form in which students were asked for their height in inches: ```r library(dslabs) data(reported_heights) str(reported_heights) ``` ``` ## 'data.frame': 1095 obs. of 3 variables: ## $ time_stamp: chr "2014-09-02 13:40:36" "2014-09-02 13:46:59" "2014-09-02 13:59:20" "2014-09-02 14:51:53" ... ## $ sex : chr "Male" "Male" "Male" "Male" ... ## $ height : chr "75" "70" "68" "74" ... ``` Unfortunately `height` is not numeric. What if we just coerce it to numeric? --- # String cleaning example Unfortunately `height` is not numeric. What if we just coerce it to numeric? ```r heights2 = reported_heights |> mutate(height_num = as.numeric(height)) ``` ``` ## Warning: There was 1 warning in `mutate()`. ## ℹ In argument: `height_num = as.numeric(height)`. ## Caused by warning: ## ! NAs introduced by coercion ``` ```r sum(is.na(heights2$height_num)) ``` ``` ## [1] 81 ``` R let us do this... but we lost a lot of information. --- # String cleaning example Let's see some examples of non-numeric entries: ```r heights_probs = filter(heights2, is.na(height_num)) View(heights_probs) x = heights_probs$height x ``` ``` ## [1] "5' 4\"" "165cm" "5'7" ## [4] ">9000" "5'7\"" "5'3\"" ## [7] "5 feet and 8.11 inches" "5'11" "5'9''" ## [10] "5'10''" "5,3" "6'" ## [13] "6,8" "5' 10" "Five foot eight inches" ## [16] "5'5\"" "5'2\"" "5,4" ## [19] "5'3" "5'10''" "5'3''" ## [22] "5'7''" "5'12" "2'33" ## [25] "5'11" "5'3\"" "5,8" ## [28] "5'6''" "5'4" "1,70" ## [31] "5'7.5''" "5'7.5''" "5'2\"" ## [34] "5' 7.78\"" "yyy" "5'5" ## [37] "5'8" "5'6" "5 feet 7inches" ## [40] "6*12" "5 .11" "5 11" ## [43] "5'4" "5'8\"" "5'5" ## [46] "5'7" "5'6" "5'11\"" ## [49] "5'7\"" "5'7" "5'8" ## [52] "5' 11\"" "6'1\"" "69\"" ## [55] "5' 7\"" "5'10''" "5'10" ## [58] "5'10" "5ft 9 inches" "5 ft 9 inches" ## [61] "5'2" "5'11" "5'11''" ## [64] "5'8\"" "708,661" "5 feet 6 inches" ## [67] "5'10''" "5'8" "6'3\"" ## [70] "649,606" "728,346" "6 04" ## [73] "5'9" "5'5''" "5'7\"" ## [76] "6'4\"" "5'4" "170 cm" ## [79] "7,283,465" "5'6" "5'6" ``` --- # String cleaning example Many of these entries have valuable information, so let's try to salvage as much as we can. **The general way to proceed is:** 1. Identify the most common patterns among the problematic entries. 2. Write an algorithm to correct these. 3. Review results to make sure your algorithm worked correctly. 4. Look at the remaining problematic entries. Tweak your algorithm or add another one. 5. Stop when all useful information is corrected (or when MB < MC). **What are the most common patterns?** -- * Strings of the form `x'y` or `x'y"` where `x` is feet and `y` is inches. * Strings of the form `x ft y inches`, except that "ft" and "inches" are inconsistent. **My approach is going to be to:** 1. Try to convert everything to the pattern `x y`. 2. `separate` the feet and inches values. 3. Calculate total inches from feet and inches. --- # String cleaning example Start by replacing 4 punctuation marks with spaces (note we have to **escape** the "): ```r x2 = x |> str_replace_all("'", " ") |> str_replace_all(",", " ") |> str_replace_all("\"", " ") |> str_replace_all(",", " ") x2 ``` ``` ## [1] "5 4 " "165cm" "5 7" ## [4] ">9000" "5 7 " "5 3 " ## [7] "5 feet and 8.11 inches" "5 11" "5 9 " ## [10] "5 10 " "5 3" "6 " ## [13] "6 8" "5 10" "Five foot eight inches" ## [16] "5 5 " "5 2 " "5 4" ## [19] "5 3" "5 10 " "5 3 " ## [22] "5 7 " "5 12" "2 33" ## [25] "5 11" "5 3 " "5 8" ## [28] "5 6 " "5 4" "1 70" ## [31] "5 7.5 " "5 7.5 " "5 2 " ## [34] "5 7.78 " "yyy" "5 5" ## [37] "5 8" "5 6" "5 feet 7inches" ## [40] "6*12" "5 .11" "5 11" ## [43] "5 4" "5 8 " "5 5" ## [46] "5 7" "5 6" "5 11 " ## [49] "5 7 " "5 7" "5 8" ## [52] "5 11 " "6 1 " "69 " ## [55] "5 7 " "5 10 " "5 10" ## [58] "5 10" "5ft 9 inches" "5 ft 9 inches" ## [61] "5 2" "5 11" "5 11 " ## [64] "5 8 " "708 661" "5 feet 6 inches" ## [67] "5 10 " "5 8" "6 3 " ## [70] "649 606" "728 346" "6 04" ## [73] "5 9" "5 5 " "5 7 " ## [76] "6 4 " "5 4" "170 cm" ## [79] "7 283 465" "5 6" "5 6" ``` --- # String cleaning example We can make this more concise by using **regular expressions** (more on this soon): ```r x2 = x |> str_replace_all("'|,|\"|,", " ") x2 ``` ``` ## [1] "5 4 " "165cm" "5 7" ## [4] ">9000" "5 7 " "5 3 " ## [7] "5 feet and 8.11 inches" "5 11" "5 9 " ## [10] "5 10 " "5 3" "6 " ## [13] "6 8" "5 10" "Five foot eight inches" ## [16] "5 5 " "5 2 " "5 4" ## [19] "5 3" "5 10 " "5 3 " ## [22] "5 7 " "5 12" "2 33" ## [25] "5 11" "5 3 " "5 8" ## [28] "5 6 " "5 4" "1 70" ## [31] "5 7.5 " "5 7.5 " "5 2 " ## [34] "5 7.78 " "yyy" "5 5" ## [37] "5 8" "5 6" "5 feet 7inches" ## [40] "6*12" "5 .11" "5 11" ## [43] "5 4" "5 8 " "5 5" ## [46] "5 7" "5 6" "5 11 " ## [49] "5 7 " "5 7" "5 8" ## [52] "5 11 " "6 1 " "69 " ## [55] "5 7 " "5 10 " "5 10" ## [58] "5 10" "5ft 9 inches" "5 ft 9 inches" ## [61] "5 2" "5 11" "5 11 " ## [64] "5 8 " "708 661" "5 feet 6 inches" ## [67] "5 10 " "5 8" "6 3 " ## [70] "649 606" "728 346" "6 04" ## [73] "5 9" "5 5 " "5 7 " ## [76] "6 4 " "5 4" "170 cm" ## [79] "7 283 465" "5 6" "5 6" ``` --- # String cleaning example Also get rid of some common words, and **trim** extra spaces: ```r x2 = x |> str_replace_all("'|,|\"|,|ft|feet|inches|and", " ") |> str_trim() x2 ``` ``` ## [1] "5 4" "165cm" "5 7" ">9000" ## [5] "5 7" "5 3" "5 8.11" "5 11" ## [9] "5 9" "5 10" "5 3" "6" ## [13] "6 8" "5 10" "Five foot eight" "5 5" ## [17] "5 2" "5 4" "5 3" "5 10" ## [21] "5 3" "5 7" "5 12" "2 33" ## [25] "5 11" "5 3" "5 8" "5 6" ## [29] "5 4" "1 70" "5 7.5" "5 7.5" ## [33] "5 2" "5 7.78" "yyy" "5 5" ## [37] "5 8" "5 6" "5 7" "6*12" ## [41] "5 .11" "5 11" "5 4" "5 8" ## [45] "5 5" "5 7" "5 6" "5 11" ## [49] "5 7" "5 7" "5 8" "5 11" ## [53] "6 1" "69" "5 7" "5 10" ## [57] "5 10" "5 10" "5 9" "5 9" ## [61] "5 2" "5 11" "5 11" "5 8" ## [65] "708 661" "5 6" "5 10" "5 8" ## [69] "6 3" "649 606" "728 346" "6 04" ## [73] "5 9" "5 5" "5 7" "6 4" ## [77] "5 4" "170 cm" "7 283 465" "5 6" ## [81] "5 6" ``` --- # String cleaning example Also remove extra spaces **within** a string: ```r x2 = x |> str_replace_all("'|,|\"|,|ft|feet|inches|and", " ") |> str_squish() x2 ``` ``` ## [1] "5 4" "165cm" "5 7" ">9000" ## [5] "5 7" "5 3" "5 8.11" "5 11" ## [9] "5 9" "5 10" "5 3" "6" ## [13] "6 8" "5 10" "Five foot eight" "5 5" ## [17] "5 2" "5 4" "5 3" "5 10" ## [21] "5 3" "5 7" "5 12" "2 33" ## [25] "5 11" "5 3" "5 8" "5 6" ## [29] "5 4" "1 70" "5 7.5" "5 7.5" ## [33] "5 2" "5 7.78" "yyy" "5 5" ## [37] "5 8" "5 6" "5 7" "6*12" ## [41] "5 .11" "5 11" "5 4" "5 8" ## [45] "5 5" "5 7" "5 6" "5 11" ## [49] "5 7" "5 7" "5 8" "5 11" ## [53] "6 1" "69" "5 7" "5 10" ## [57] "5 10" "5 10" "5 9" "5 9" ## [61] "5 2" "5 11" "5 11" "5 8" ## [65] "708 661" "5 6" "5 10" "5 8" ## [69] "6 3" "649 606" "728 346" "6 04" ## [73] "5 9" "5 5" "5 7" "6 4" ## [77] "5 4" "170 cm" "7 283 465" "5 6" ## [81] "5 6" ``` --- # String cleaning example A few more tweaks: ```r x2 = x |> str_replace_all("'|,|\"|,|\\*|ft|feet|inches|and", " ") |> str_squish() |> str_replace(" \\.", " ") x2 ``` ``` ## [1] "5 4" "165cm" "5 7" ">9000" ## [5] "5 7" "5 3" "5 8.11" "5 11" ## [9] "5 9" "5 10" "5 3" "6" ## [13] "6 8" "5 10" "Five foot eight" "5 5" ## [17] "5 2" "5 4" "5 3" "5 10" ## [21] "5 3" "5 7" "5 12" "2 33" ## [25] "5 11" "5 3" "5 8" "5 6" ## [29] "5 4" "1 70" "5 7.5" "5 7.5" ## [33] "5 2" "5 7.78" "yyy" "5 5" ## [37] "5 8" "5 6" "5 7" "6 12" ## [41] "5 11" "5 11" "5 4" "5 8" ## [45] "5 5" "5 7" "5 6" "5 11" ## [49] "5 7" "5 7" "5 8" "5 11" ## [53] "6 1" "69" "5 7" "5 10" ## [57] "5 10" "5 10" "5 9" "5 9" ## [61] "5 2" "5 11" "5 11" "5 8" ## [65] "708 661" "5 6" "5 10" "5 8" ## [69] "6 3" "649 606" "728 346" "6 04" ## [73] "5 9" "5 5" "5 7" "6 4" ## [77] "5 4" "170 cm" "7 283 465" "5 6" ## [81] "5 6" ``` --- # String cleaning example This looks pretty good, so let's apply this code to the original data: ```r heights3 = reported_heights |> # Preserve original height column rename(height_orig = height) |> mutate(height_clean = height_orig) |> # Clean height values mutate( height_clean = str_replace_all(height_clean, "'|,|\"|,|\\*|ft|feet|inches|and", " "), height_clean = str_squish(height_clean), height_clean = str_replace_all(height_clean, " \\.", " ") ) # Calculate total inches heights4 = heights3 |> # Separate feet and inches into 2 columns separate(height_clean, sep = " ", into = c("feet", "inches"), fill = "left") |> # Coerce them to numeric mutate(across(c(feet, inches), as.numeric)) |> # Replace NAs in feet with 0's mutate(feet = replace_na(feet, 0)) |> # Calculate new height mutate(height_clean = feet * 12 + inches) ``` ``` ## Warning: Expected 2 pieces. Additional pieces discarded in 2 rows [311, 1082]. ``` ``` ## Warning: There were 2 warnings in `mutate()`. ## The first warning was: ## ℹ In argument: `across(c(feet, inches), as.numeric)`. ## Caused by warning: ## ! NAs introduced by coercion ## ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning. ``` --- # String cleaning example Great. Now let's go back and look at the remaining non-numeric values: ```r filter(heights4, is.na(height_clean)) ``` ``` ## time_stamp sex height_orig feet inches height_clean ## 1 2014-09-02 15:16:37 Female 165cm 0 NA NA ## 2 2014-09-02 15:16:56 Male >9000 0 NA NA ## 3 2014-10-08 19:19:33 Female Five foot eight inches 0 NA NA ## 4 2015-05-12 19:23:51 Female yyy 0 NA NA ## 5 2017-06-19 04:20:32 Male 170 cm 170 NA NA ``` Only 5 left. We could do a bit more with them, but let's set them aside for now. Next, look at the numerical range of height values. Do they make sense? ```r summary(heights4$height_clean) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## 0.0 65.0 69.0 117.2 72.0 11111.0 5 ``` --- # String cleaning example If you `View` the data (I recommend sorting with `arrange(height_clean)`), you'll find: * Many values between 5 and 7 which are clearly in **feet** instead of inches. * Many values between 150 and 214 which are clearly in **cm** instead of inches. ```r heights5 = heights4 |> mutate(height_clean = case_when( # Convert values in feet height_clean >= 5 & height_clean <= 7 ~ height_clean * 12, # Convert values in cm between(height_clean, 150, 214) ~ height_clean / 2.54, # Otherwise keep same value TRUE ~ height_clean )) |> arrange(height_clean) ``` --- # String cleaning example Now, how many values are outside of plausible values? ```r heights5 |> mutate(ok = between(height_clean, 3.5*12, 7.5*12)) |> count(ok) ``` ``` ## ok n ## 1 FALSE 30 ## 2 TRUE 1060 ## 3 NA 5 ``` 1. Some of these remaining values may still contain interpretable information. **There may be more cleaning to do.** 1. Some of them may not, in which case we probably won't use them for analysis. **But don't discard them yet!** We'll come back to extreme values (aka outliers) in a couple of weeks. 1. You'll find there are also a few instances where our cleaned value appears sensible, but the original value does not. **You may need to tweak the algorithm further.** --- # Regular expressions Regular expressions are code to describe patterns within strings. They are not specific to R but work across basically all programming languages. ```r names = c("Python", "SPSS", "Stata", "Julia") # Match strings that CONTAIN a lowercase "t" str_view_all(names, "t") ``` ``` ## Warning: `str_view()` was deprecated in stringr 1.5.0. ## ℹ Please use `str_view_all()` instead. ## This warning is displayed once every 8 hours. ## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was ## generated. ```
--- # Common regular expressions .pull-left[ Match strings that START with a capital "S": ```r str_view_all(names, "^S") ```
] .pull-right[ Match strings that END with a lowercase "a": ```r str_view_all(names, "a$") ```
] `^` and `$` are called **anchors**. --- # Common regular expressions .pull-left[ Match all lowercase vowels: ```r str_view_all(names, "[aeiou]") ```
] .pull-right[ Match everything BUT lowercase vowels: ```r str_view_all(names, "[^aeiou]") ```
] --- # Common regular expressions .pull-left[ Use a vertical bar for "or": ```r str_view_all(names, "Stata|SPSS") ```
] .pull-right[ And parentheses to clarify: ```r str_view_all(names, "S(tata|PSS)") ```
] --- # Last remarks on regular expressions **All kinds of regex cheat sheets and interactive testers are available via a quick Google.** Regexps are hard to read and troubleshoot. Try not to get too deep into them -- you can often accomplish the same goal by breaking it up into smaller chunks. > Some people, when confronted with a problem, think "I know, I’ll use regular expressions." Now they have two problems. - Jamie Zawinski --- # Last remarks on regular expressions This is (the start of) a real regular expression that checks whether an email address is valid: `(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*:(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)(?:,\s*(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*))*)?;\s*)` --- # Useful functions for cleaning strings `stringr` functions we've used here: * `str_replace` and `str_replace_all`: Replace parts of strings. * `str_trim` and `str_squish`: Remove extra spaces. * `str_view_all`: Illustrates matches, to help develop regular expressions. Other tidyverse functions we've used: * `between`: Test whether values fall within a numerical range. * `case_when`: Multiple conditional expressions. * `replace_na`: Set missing values to a certain value. Other useful `stringr` functions: * `str_sub`: Subset strings by position of characters. * `str_detect`: Test whether a string matches a pattern. Other useful tidyverse functions: * `na_if`: Set a certain value to missing. * `bind_rows`: Append two datasets that have the same variable structure. --- class: inverse, middle name: checklist # Data Cleaning Checklist --- # Data Cleaning Checklist Neither exhaustive nor necessary for all purposes, but a good starting point. *** 1. **Convert file formats** as necessary, and import your data. 1. **Structure data into tidy format** if not already. 1. **Remove irrelevant, garbage, or empty** columns and rows. 1. **Identify the primary key**, or define a surrogate key. 1. **Resolve duplicates** (remove true duplicates, or redefine the primary key). 1. **Understand the definition, origin, and units** of each variable, and document as necessary. 1. **Rename variables** as necessary, to be succinct and descriptive. --- # Data Cleaning Checklist Neither exhaustive nor necessary for all purposes, but a good starting point. *** <ol start=8> <li><b>Convert variable formats</b> as necessary: <ul> <li> Numeric variables -- may be inappropriately stored as strings when there are typos. <li> Dates and times -- store in date or time format. <li> Binary variables -- code as 0/1 integers (not "Yes"/"No" or 1/2). <li> Factors -- use when strings take a limited set of possible values. <li> ID variables -- store as integers or character, not numeric. <li> Strings of digits -- store as character, not numeric. </ul> <br> <li><b>Understand patterns of missing values.</b> <ul> <li> Find out why they're missing. <li> Make sure they are not more widespread than you expect. <li> Convert other intended designations (i.e., -1 or -999) to NA. <li> Distinguish between missing values and true zeros. </ul> <br> </ol> --- # Data Cleaning Checklist <ol start=10> <li><b>Make units and scales consistent.</b> Avoid having in the same variable: <ul> <li> Some values in meters and others in feet. <li> Some values in USD and others in GBP. <li> Some percentages as 40% and others as 0.4. <li> Some values as millions and others as billions. </ul> <br> <li><b>Enforce logical conditions on quantitative variables.</b> <ul> <li> Define any range restrictions each variable should satisfy, and check them. <li> Correct any violations that are indisputable data entry mistakes. <li> Create a flag variable to mark remaining violations. </ul> <br> <li><b>Clean string variables</b> as needed. Some common operations: <ul> <li> Make entirely uppercase or lowercase <li> Remove punctuation <li> Trim spaces (extra, starting, ending) <li> Ensure order of names is consistent <li> Remove uninformative words like "the" and "a" <li> Correct spelling inconsistencies (consider text clustering packages) </ul> <br> </ol> --- # Data Cleaning Checklist Neither exhaustive nor necessary for all purposes, but a good starting point. *** <ol start=13> <li><b>Save your clean data</b> to disk before further manipulation (merging data, transforming variables, restricting the sample). Think of the whole wrangling/cleaning/analysis pipeline as 2 distinct phases: <ul> <li> Taking messy data from external sources and making a nice, neat table that you are likely to use for multiple purposes in analysis. <li> Taking that nice, neat table and doing all kinds of new things with it. </ul> <br> </ol> --- # Best practices for data cleaning 1. **Record all steps in a script.** -- 1. **Never overwrite the original raw data file.** -- 1. **When editing values, identify observations by substantive logical conditions** rather than by observation ID or (even worse) row number. You want the changes you make to be rule-based, for 2 reasons: * So that they're general -- able to handle upstream changes to the data. * So that they're principled -- no one can accuse you of cherry-picking. -- 1. **Look at your data** every step of the way, to spot issues you haven't thought of, and to make sure you're actually doing what you think you're doing. --- # Summary 1. [Another join example](#join) 1. [Keys and relational data](#keys) 1. [Number storage](#numbers) 1. [String cleaning](#strings) 1. [Data Cleaning Checklist](#checklist)