Lecture 12

class: center, middle, inverse, title-slide

# Lecture 12
## Getting to Know Your Data: Cleaning and Exploration
### Tyler Ransom
### ECON 5253, University of Oklahoma

---

# Plan for the day

- Overview of the data preparation process

- Tips for getting to know your data

---
# Data Cleaning

- .hi[Data Cleaning:] the process by which you "tame" your "wild" data

- This is an important step of the process because it is at this step where *measurement* is determined

- (e.g. "how many cigarettes per day do I define to be 'smoking'?")

- Measurement is important but often overlooked

- Your conclusions are limited by the quality of your data!

---
# Data Cleaning Steps (1/2)

You may need to do the following steps during the data cleaning process:

* .hi[Convert] file format 
    * e.g. JSON to CSV
* .hi[Rectangularize] your data
    * e.g. store in data frame (or other useful container)
* .hi[Clean up string] variables
    * e.g. remove "stop words" like "the," remove punctuation, keep only word stems
* Convert .hi[format] of variables for downstream use
    * e.g. convert dates to specific format
* .hi[Normalize] values of numerical data
    * e.g. standard deviation units

---
# Data Cleaning Steps (2/2)

* .hi[Impute] missing values
* .hi[Merge] with other existing data sources
* .hi[Remove] outlier observations
    * e.g. by top- or bottom-coding aka *winsorizing*
* .hi[Fix typos] in numeric variables
    * e.g. `5.25` instead of `525`
* .hi[Browse] the data

---
# Data Exploration
- .hi[Data Exploration:] the process by which you more fully inspect your data

- Data cleaning and data exploration constitute an interative process

- This is yet another reason to make sure you write code that is reproducible!

- You are only ever "done" with code after running it 100+ times

---
# Data Exploration Steps (1/2)

Data exploration may include the following steps:

* Computing .hi[descriptive statistics]
    * e.g. mean, variance, min, max, median, etc.
* Creating .hi[frequency tables]
    * e.g. to tell if a variable is continuous or discrete
* Creating .hi[transformations] of existing variables
    * e.g. `x^2`, `ln(x)`, etc.
* Creating .hi[new variables] from combinations of existing variables
    * e.g. `x*z`, `x*w*z`, etc.
    * This is also called "feature engineering" by Computer Scientists

---
# Data Exploration Steps (1/2)

* .hi[Visualizing] your data
    * e.g. charts, graphs, scatter plots, etc.
* Running .hi[preliminary models]
    * e.g. pairwise correlations, regressions, etc.
* .hi[Selecting subsamples]
    * e.g. throw out baby boomers because millenials are the population of interest

---
# Questions you should ask yourself
Throughout the process of data cleaning/exploration, you should constantly be asking yourself the following questions:

* What is my unit of analysis?
* How am I measuring the phenomena I am attempting to research?
* Do the variables mean what I think they should mean?
* Does my data set have the expected number of observations?
* What fraction of values are missing in my data? Is this what I would have expected?
* Are variables named in such a way that if I were to give this data set to someone else, they would understand the contents of the data?
* Are the variables I am interested in continuous or discrete?
* Does anything about my data look *funny*?

---
# Normalization of (continuous) numerical data

- Often, it will be useful to normalize your data in specific ways

- Why is it advantaeous to normalize numerical data?

1. It often greatly enhances numerical performance

2. It can help you interpret your model's results down the line (i.e. "A one-unit increase in X corresponds to a one-standard-deviation increase in Y")

---
# Z-score

- This is done by computing `z = (x-mu)/sigma` where `mu` is the sample average and `sigma` is the sample standard deviation

- The resulting variable takes on values typically between -5 and +5, and each unit is interpreted as one standard deviation

- .hi[example:] In education data, student test scores are often expressed in z-score units to facilitate "apples to apples" comparisons

- In data science applications, many ML algorithms require z-scored data to appropriately run

- This is because unscaled data can sometimes cause numerical issues

- Many ML packages will do this z-scoring for the user as part of the algorithm

---
# max-min normalization

- This is done by normalizing the variable relative to its max and min

- The transformed variable takes on values in [0,1]:

`z = (x-min(x))/(max(x)-min(x))`

- The issue with this method is that if your data set has crazy outliers, 95% of the data will be clustered at 0.5 in the normalized variable

---
# Sigmoidal normalization

- This is another way to transform a variable to fall in the (0,1) interval

- This method is immune to outliers:

`z = exp(x)/(1+exp(x))`

or, equivalently,

`z = 1/(1+exp(-x))`

It's called .hi[sigmoidal] because the function `f(x) = 1/(1+exp(-x))` is known as the sigmoid (or logistic) function

---
# Hyperbolic tangent normalization

- If one wanted to normalize a variable to fall in the interval (-k,k), one could use the hyperbolic tangent function:

`z = k*tanh(x)`

This is a lot like the sigmoid function, only it goes from (-k,k) instead of (0,1)

---
# Discretization

- Sometimes, it's useful to convert a continuous variable into a discrete one

- This can be done in a myriad of ways

- We won't cover these in detail, but think of how a histogram is built

---
# Other transformations

- It may also help to transform variables to fit your specific needs. One common transformation is the `log` transformation:

`z = ln(x)`

- If you are examining earnings, sales, or some other nominal currency quantity, these distributions often are approximately log-normal
- If you take the log of a log-normal distribution, you get back a normal distribution, which is very convenient to work with.
- In general, you should think "log-normal" if you see a variable that is:
    1. continuous
    2. always positive
    3. has a lot of observations clustered around a small(ish) value
    4. has a really, really long tail

---
# Imputation
So you've got some missing values in your data. What can you do to fix them?

* Listwise deletion
* Mean (or mode) substitution
* Regression (or k-nearest neighbors) matching
* Multiple imputation
* Bayesian methods
* Create a binary variable indicating missingness and use it as a control in downstream models
* Interpolate missing values (very useful for panel data!)

---
# Why are the values missing?

- The key to choosing an imputation scheme is knowing why the values are missing in the first place

- Are they missing e.g. because users randomly had a glitch that kicked them off the website? Or are they missing because some users didn't feel like clicking "purchase" and instead logged off, leaving the product in their cart?

---
# Types of missing data

There are three types of "missingness":

1. Missing Completely At Random (MCAR)

2. Missing At Random (MAR)

3. Missing Not At Random (MNAR)

---
# Missing Completely At Random (MCAR)

- In MCAR, values are missing for completely random reasons

- i.e. a weighted coin was flipped which determined whether the value made it into our data set

- Think of this like a randomized experiment

- MCAR almost never happens in real life

---
# Missing At Random (MAR)

- In MAR, values are not missing at random, but we can completely account for their missingness by using variables for which we have information

- This would be like a regression model which satisfies the unconfoundedness assumption

---
# Missing Not At Random (MNAR)

- In MNAR, values are missing for reasons that are _not_ completely random, and which cannot fully be accounted for by other variables for which we have information

- Think: an endogenous choice which is made based on utility maximization

- .hi[sample selection bias]

- canonical example: only observe wages for people who are employed; employment is not randomly assigned

- newer example: only observe tweets of twitter users; twitter accounts (and usage) not randomly assigned

---
# Ways to resolve missing data problem

- Listwise deletion
- Mean (or mode) substitution
- Matching
- Multiple imputation
- Bayesian methods
- Create missing value dummy variables
- Interpolation

---
# Listwise deletion

- If the values are .hi[missing completely at random (MCAR)] then we can simply drop the entire offending row and our final results will not be biased

- The downside is that we will lose statistical power (because we will have fewer observations)

- Also, if lots of variables are MCAR in different ways, this can drastically reduce the number of observations

---
# Mean (or mode) substitution

- Another way to impute missing values is to fill in missings with the mean (if variable is continuous) or the mode (if variable is discrete)

- This is not a great way to fix the problem

- On the one hand, the average (mode) will be unchanged

- On the other hand, the variance will be severely deflated

---
# Matching

- One can also impute by picking another observation for which the value is not missing, and which all other variables are as similar as possible

- Essentially, we find a "donor" for the observation that is missing, and say "If the observations look similar enough, we can plug in the donor's non-missing value"

- We can use a number of ways to find the donor: OLS, k-nearest neighbors, random forest, etc.

- This method works great if values are MAR

- It is important to keep in mind that these model-based imputation methods may understate the true variance of the distribution, so it is sometimes helpful to inject noise into the imputed values

---
# Multiple imputation

- Multiple imputation is a multi-step process:

1. Impute values M different times (via any of the above methods)
2. With M different data sets, conduct your analysis on each data set
3. Consolidate analyses (e.g. by averaging)

- Example: `MICE` (Mulitple Imputation by Chained Equations)

- At the end of the day, most multiple imputation methods are variations on "matching" which requires MAR

---
# Bayesian methods

- Another way to impute is to use Bayesian methods

- The R package `MixedDataImpute` uses a state-of-the-art statistical model to impute missing values

- Its performance exceeds that of MICE and other multiple imputation methods

---
# Create binary variable for missing values

- My preferred way to impute is to create a new variable that is `1` when the values of `x` are missing, and `0` otherwise

- To create this, I follow these two steps:
    1. Create the variable (e.g. `m_x`) equal to 1 if `x` is missing, 0 otherwise
    2. Recode the missing values of `x` to be some constant value outside the support (e.g. 50)

- The key thing to remeber is that `x` now has a mass point at 50, so when computing summary statistics, you should exclude these observations.

- The nice thing about this is that you can, for example, run a regression and include `m_x` as a control variable, and it will estimate a separate effect for `x` for the group of values that are missing

---
# Interpolation

If you have panel data, you can fill in missing values by interpolating

Example from my research:

* follow individuals over extended period of time
* in some years (e.g. 2010), individuals don't report their earnings
* but I can see that they are working at the same job in 2009 and 2011, and I see their earnings in those years
* I can .hi[interpolate] their 2010 earnings by taking the average of the 2009 and 2011 earnings

---
# Notes on Interpolation

- Interpolation requires having valid values on each boundary point

- If you don't have this, then it's called *extrapolation* and it's much more prone to error

- In my example above, if I observed earnings in 2009 and 2010, but not 2011, I would not be comfortable setting 2011 earnings to be the average of 2009 and 2010 because the individual may have gotten a promotion, etc.

---
# Data exploration: descriptive statistics
Descriptive statistics (mean, median, variance, min, max) may seem "boring" but they are the easiest way for you to summarize your data

It's often helpful to generate a summary table that lists the descriptive statistics for each of the variables you are studying (either outcomes or controls), for example:

| Variable   | Obs.    | Mean      | Std. Dev. |  Median    | Min       | Max        |
|------------|---------|-----------|-----------|------------|-----------|------------|
| Sales      |  15,147 | 13,456.50 | 16,579.81 |  10,500.00 |    501.00 | 165,789.00 |
| Morning    | 363,528 |      0.24 |      0.43 |       0.00 |      0.00 |       1.00 |
| Afternoon  | 363,528 |      0.36 |      0.48 |       0.00 |      0.00 |       1.00 |
| Evening    | 363,528 |      0.35 |      0.48 |       0.00 |      0.00 |       1.00 |
| Night      | 363,528 |      0.05 |      0.22 |       0.00 |      0.00 |       1.00 |
| Income     | 363,528 | 96,501.73 | 87,389.19 |  82,000.00 | 23,000.00 | 425,000.00 |
| Age        | 363,528 |     55.87 |     12.77 |      52.00 |     14.00 |      96.00 |
| Female     | 363,528 |      0.42 |      0.49 |       0.00 |      0.00 |       1.00 |

Anything wrong with this table? Anything that jumps out at you?

---
# Data exploration: frequency tables

It's often helpful to compute frequency tables of each of your variables that are discrete (categorical)

This will help you visualize the distribution of these variables

For example:

* Is the distribution of this three-category variable roughly uniform?

* Or are 95% of the values in one of the three categories?

---
# Data exploration: preliminary models

It often helps to run preliminary statistical models to see how your variables covary, how sensitive raw correlations are to inclusion of additional predictors, etc.

* pairwise correlations (including rank correlations)

* concordance indices (e.g. Jaccard similarity)

* OLS

* Logit, Probit, etc.

We'll talk more about these next week

---
# Data visualization

Commonly used data visualization types that are useful in getting to know your data:

* box-and-whisker plot
* scatter plot
* histogram
* kernel density plot
* [binned scatter plots](https://michaelstepner.com/binscatter/binscatter-StataConference2014.pdf) (aka "binscatter")
* line charts
* maps
* summary tables (tables are a visualization tool!)
* word clouds (for textual data)
* many others

---
# When to use various visuals?

Different visualization tools can be used for different tasks:

* Find outliers
* Examine distribution of values 
    - box-and-whisker, histogram, kernel density
* Report correlations
    - scatter plot or binscatter (for two continuous variables)
    - plots by "facet" (for one continuous variable and one categorical variable)
* Examine trends in time series data
    - line charts
* Text trends
    - word clouds
    - word trees

---
# Coding visuals in R: `ggplot2`

See [here](http://r4ds.had.co.nz/data-visualisation.html) for a step-by-step guide on the basics of creating data visuals in R using `ggplot2`.

R package for binscatters: [binscattr](https://github.com/paulgp/binscattr) (still in its infancy)

---
# Directory of visualization types (from Duke library)

[This website](https://guides.library.duke.edu/datavis/vis_types) has a nice summary of types of visuals that may be helpful in telling your story.

---
# The OG data visualization

.center[
<img width="600" src="http://geoawesomeness.com/wp-content/uploads/2013/03/John-Snows-cholera-map.jpg" alt="Cholera Map">
]

---
# Beautiful visualization

Reddit for beautiful visualizations: [r/dataisbeautiful](https://www.reddit.com/r/dataisbeautiful/)

Comical opponent: [r/dataisugly](https://www.reddit.com/r/dataisugly/)

* You don't want to end up on the latter site

---
# Jupyter notebooks (in Python) created by Jason DeBacker (U of South Carolina)

* [Static Visualizations](https://github.com/jdebacker/CompEcon_Fall17/blob/master/Visualization/StaticVisualization.ipynb)
* [Dynamic Visualizations](https://github.com/jdebacker/CompEcon_Fall17/blob/master/Visualization/DynamicVisualization.ipynb)

---
# Economic mobility data used by DeBacker
This module of the course covers data visualization tools.  We'll learn these tools by studying economic mobility using data from the [Equality of Opportunity Project](http://www.equality-of-opportunity.org).

[Where is the Land of Opportunity? The Geography of Intergenerational Mobility in the United States](https://eml.berkeley.edu/~saez/chetty-friedman-kline-saezQJE14mobility.pdf)

---
# Useful Links

* [Matplotlib guide](http://pbpython.com/effective-matplotlib.html)
* [Overview of Python visualization tools](http://pbpython.com/visualization-tools-1.html)
* [A Dramatic Tour Through Python's Data Visualization Landscape](https://dsaber.com/2016/10/02/a-dramatic-tour-through-pythons-data-visualization-landscape-including-ggplot-and-altair/)
* [A Detailed Matplotlib Example](https://www.dataquest.io/blog/making-538-plots/)
* [Matplotlib Gallery](https://matplotlib.org/gallery.html)
* [Seaborn Gallery](https://seaborn.pydata.org/examples/index.html)
* [Bokeh Gallery](https://bokeh.pydata.org/en/latest/docs/gallery.html)