# Data analysis February 10, 2023
## Story time Note: - My PI likes to use overleaf, and before every meeting I need to upload all the new analysis results to the website and recompile the document - My PI asked me to run some complicated regressions. I spent a lot of time and energy getting the code to run. When it finally did, I outputted it into a table and shared it with my PI, who immediately noticed the number of observations was varying a lot from one column to the other - My PI decided to change the set of control variables for all analysis in a project and it took me a long time to find all the different lines in all the different scripts where I listed the controls - My results changed every time I ran a regression - My PI asked me to do some initial analysis. I created a table and exported it to Excel. They asked me to do some very specific formating on Excel and I did it, but then they decided to make a lot of changes to the analysis and for every change I had to spend a lot of time formatting the table again only for them to decide they wanted try to something else
# Writing analysis code
## When it comes to *coding*, analysis is the easy part --- ## A good analysis script
Starts with a fresh workspace
Loads the data created during construction
Runs a regression, calculates summary statistics code or creates a graph
Displays or exports results
Note: - it starts with a fresh workspace so it does not depend on having run anything in particular before - note that there is no data processing going on --- ## A good analysis script - Makes research decisions clear - Has simple code that allows the reader to focus on the model - Makes it easy to understand which results are coming from each part of the code - Is easy to maintain even if research decisions change --- ## Applying good coding practices to analysis code --- ## Maintainability
The DRY rule:
"Every piece of knowledge must have a single, unambiguous, authoritative representation within a system"
--- ## Modularity
Modular programming
is a software design technique that emphasizes separating the functionality of a program into independent, interchangeable modules, such that each contains everything necessary to execute only one aspect of the desired functionality.
--- ## Don't reinvent the wheel
There are plenty of existing packages for statistical analysis and presentation of results
Canned functions often go through some sort of review
They are also usually programmed to handle errors and edge cases
Don't try to write code from scratch unless there is no other way to implement your analysis
If you are not 100% confident about a package, test it and compare results with those of another software
--- ## More opinionated advice - Define research inputs such as outcomes and controls in one place, and retrieve these inputs when needed - Use functions to simplify and standardize aspects of the code that are not related to the research, such as graphics themes and table formatting - Use file names and document outlines to connect code to specific results
# Common issues in data analysis
## Unstable results
The most common source of unstable results are random processes
Remember to set a seed if they are present in your code
Software version can also make a different here
In Stata, the order of observations may also matter -- and it should not!
--- ## Unstable results - An easy way to find out if your results are unstable is to track changes to them - Export tables to TeX or csv and track them using git - GitHub can also track changes to images - Rerun analysis after making changes to the data even if the analysis scripts didn't change and check if your results are changing --- ## Using categorical variables correctly - Stata does not automatically recognize labeled variables as categorical variables - To control for different categories of a variable, use factors in R and the `i.` operator in Stata - Avoid creating a series of dummies to represent each individual category - If you are not interested in the coefficients of fixed effects, use commands that don't report them automatically --- ## Using interactions correctly
Both R and Stata have built-in interaction operations
Using built-in operations instead of manually multiplying variables is one example of how not to reinvent the wheel
What is the difference between `#` and `##` in Stata?
In R, apart from the base `*` operator, `fixest` offers additional interaction functionality
Note: - why not manually calculate interaction variables?
# A workflow for data analysis
## Start simple
It's helpful to start writing analysis code with the most simple version of a model, even if you don't plan to report these results
Run a linear model before trying something more complex
Add only a few covariates at a time
IDf working with large data sets, use only a subsample at first
You can even work with simulated data if you don't have the real data yet
Use this opportunity to test your code and methods
Make the code more efficient
Think about the best practices discussed before
Make sure you understand your modelling choices
--- ## Start simple - For initial and exploratory results, you want to be able to update outputs quickly - These outputs will typically only be used by people in the research team who are familiar with the project - Markdown documents are a great solution for this, with the added advantage of being able to see the code - Don't stress too much about the perfect formatting for every single table and figure - Iterate on this format of document until you know what results you want to present --- ## But be mindful of details - When your code runs, you are only half done! - Spend some time interpreting results and understanding them before sharing them with PIs - Think critically about the numbers you see: - Do they make sense to you? - What additional questions can you anticipate based on the results? --- ## Polishing final outputs - You want to minimize the number of times you will make precise adjustments to the aesthetics of outputs - But at least one team member needs to learn the nitty gritty of formatting tables and creating graphs - Agree on a style before you start formatting tables and figures - Have in mind who is the audience of the outputs while doing so - Remeber that tables and figures should be self-standing: labels and detailed notes are important --- ## Writing code for final outputs - Automate the creation of outputs and final documents as much as possible - Stay away from workflows that rely on copy-pasting - Use accessible formats to save output (png, svg, tex, csv) - In Stata: `outreg2`, `esttab`, `graph export` - In R: `stargazer`, `huxtable`, `kable`, `modelsummary`, `ggsave`
Exercises
## Exercise - If you haven't yet, choose a reproducibility package from the list to work on - Try to find out which part of the script creates which exhibit without looking at the read me file - Take a closer look at the analysis code and see if you can come up with suggestions on how to improve it --- ## Exercise - Choose one of the visualizations in the paper and see how well it does according to [this checklist](https://stephanieevergreen.com/rate-your-visualization/) - Choose one of the tables in the paper and see how many of the items in [this checklist](https://devinnovationlab.github.io/guides/templates/paper-submission.html#review-results-and-summary-statistics-tables) it satisfies