Data cleaning

## Exercise

For each of the data tables in `data/tidy`, check:

<ul>
                <li>What variables are included?</li>
                <li>What type of data do they contain (continuous, categorical, binary, date)?</li>
                <li class="fragment fade-in">In what format is this information stored?</li>
                <li class="fragment fade-in">Is this the most efficient way to store it?</li>
            </ul>
            
            ---

## Data cleaning

- Input: a "raw" tidy data set
            - Objectives:
              - Understanding the contents of the data
              - Making the data set easy to understand and to use in statistical software
            - Output:
              - A data set containing the same information as before, but easier to handle in statistical software
              - Documentation about the contents of the data

---

## Data format

Statistical software usually have a preferred way of handling different types of data
            - In R:
              - Categorical variables are stored as factors
              - Binary variables are stored as booleans
              - Text variables are stored as characters
              - Numeric variables as stored as integers or numerics

---

## Data format

Statistical software usually have a preferred way of handling different types of data
            - In Stata:
              - Categorical variables are stored as labeled integers
              - Binary variables are stored as 0/1 integers
              - Text variables are stored as strings
              - Numeric variables are stored as bytes, ints, longs, floats, or doubles

---

## Data format

There are many advantages to using data in the software-specific format for each data type:
            - Using type-specific functions (for example for dates, text, and factors)
            - Pre-defined ways to represent information in graphs
            - Pre-defined ways to handle information in tables and regressions
            - Optmized data storage

---

## Data format
            
            Useful tool to identify data formats

- In Stata:
              - `codebook`
            - In R:
              - `skim` (from `skimr`)
              - `makeDataReport` (from `dataReporter`)

---

## Categorical variables

Useful tools to change the format of categorical variables

- In R:
              - `factor`: encode a variable, define category labels, order factors
              - `as_factor`: change variable type from numeric or string to factor
              - `forcats` is a package to handle multiple categorical variable operations
            - In Stata:
              - `label value`: add category labels to numeric variables
              - `encode`: turn string variables into labeled integers
              - `iecodebook` (from `ietoolkit`): bulk recode and label variables

---

## Missing values

- Primary data often includes survey codes used to indicate non-responses, such as "Declined to answer" or "Don't know"
            - Keeping these values in the data will bias estimates, so they should be replaced with missing values
            - Ideally, however, we want to keep the information that some missing values are different than others
            - In Stata, this can be acomplished using *extended* missing values (`.a`, ..., `.z`), which can also be labeled

---

## Missing values

Useful tools for handling missing values

- In Stata:
              - `recode`
              - `mvdecode`
            - In R:
              - `na_if` (from `dplyr`)

---

## Data documentation

**Data dictionary:** a list of the variables in the data, their definitions, types, codes, and additional metadata

- In Stata: `iecodebook` (from `iefieldkit`)
            - In R: `dataMeta` and `datadictionary` packages

---

## Data documentation

**Codebook** a summary of the contents of the data, such as share of missing values, number of unique values, and distribution

- In Stata: 
              - `codebook`
              - `iesave` (from `ietoolkit`)
            - In R: 
              - `skimr`

---

## Tracking changes to data

- As discussed earlier today, we are not allows to store data in github
            - However, storing codebooks is an efficient way to track changes to the data
            - It can also be a secure way to do so, as long as no sensitive data (e.g. examples of text values or gps coordinates) are included

---

## Tracking changes to data

- In Stata, `iesave` has an option called `report` that saves a markdown document with the codebook from a data set
            - In R, you can export the output of `skim` to a csv file