class: center, middle, inverse, title-slide .title[ # Descriptive Analysis ] .subtitle[ ##
R for Stata Users
] .author[ ### Luiza Andrade, Marc-Andrea Fiorina, Rob Marty, Maria Reyes Retana, Rony Rodriguez Ramirez, Luis Eduardo San Martin, Leonardo Viotti ] .date[ ### The World Bank |
WB Github
March 2024 ] --- # Introduction ### Initial Setup .panelset[ .panel[.panel-name[If You Attended Session 2] 1. Go to the `dime-r-training-mar2024` folder that you created yesterday, and open the `dime-r-training-mar2024` R project that you created there. ] .panel[.panel-name[If You Did Not Attend Session 2] 1. Open RStudio. 2. Type in the following lines, replacing “YOURFILEPATHHERE” with a file path where the file path where you will place this R project. ```r install.packages("usethis") library(usethis) usethis::use_course( "https://github.com/worldbank/dime-r-training/archive/main.zip", destdir = "YOURFILEPATHHERE" ) ``` 3\. In the console, type in the requisite number to delete the .zip file (we don't need it anymore) 4\. A new RStudio environment will open. Use this for the session today. ] ] --- # Table of contents .vlarge[ 1. [Quick summary statistics](#exploring) 2. [Descriptive statistics tables](#desc_tables) 3. [Exporting descriptive statistics tables](#exporting) 3. [Formatting tables](#beautifying) 3. [Aggregating observations](#aggregating) 3. [Running regressions](#regressing) 3. [Exporting regression tables](#reg_tables) ] ??? I'm here to talk to you about creating tables in R. Although in terms of data analysis this is very similar to data visualization, in the sense that what we are doing is trying to display information about the data in the most concise and informative manner, the tools and packages required to implement the two are very different. Which is why we separated them into two sessions. I will tell you exactly what are we going to do for the next hour and a half in a little bit, but first, let me ask you all something: what software do you usually export your tables to? The objection of this session is to show you have to do 4 things: - print quick statistics to explore your data - export summary statistics tables in the most reproducible way possible - run simple regressions - export regression tables I think most of you here know me already and have heard my spiel before, but since that's what I do, let's take a look at what I mean by reproducible --- # Workflows for outputs ## .red[Not reproducible] Anything that requires
Copy-pasting
Manual formatting after exported -- ## .green[Reproducible]
R Markdown: dynamic document containing code and text that is exported directly from R into PDF, HTML, Word, Power Point and other formats
LaTeX: typesetting system used for scientific publications that automatically reloads tables and figures every time the document is rendered ??? What is NOT reproducible? Anything that requires manual steps to update results in your final document after you update the data or the exact specification. This includes the terrible practice of printing results in the console and pasting them into Word, but also the much less terrible practice of exporting results to Excel and then manually formatting them and copying into Word. Can someone tell me why these are not ideal practices? The two best options to combine with R in terms of reproducibility are Markdown and LaTeX. Markdown is R's dyamic document framework and it's amazingly well developed. Most R advanced R users actually use Markdown to display their results instead of exporting tables and figures. I'm going to show you what that looks like, but this is a slightly more advanced topic that will not be covered on this course. LaTeX, on the other hand, is widely used among non-R users, and there are plenty of packages to export tables to it in Stata as well. But that's enough of me talking. Let's get you all to run some code --- # Setting the stage Load the packages that we will use today ```r # Install new packages install.packages("modelsummary") install.packages("fixest") install.packages("janitor") install.packages("huxtable") install.packages("openxlsx") ``` ```r # Load packages library(here) library(tidyverse) library(modelsummary) library(fixest) library(janitor) library(huxtable) library(openxlsx) ``` --- # Setting the stage Load the data that we will use today: Stata's `census` dataset **Tip**: Use `here`, as we saw in the data wrangling session.
−
+
02
:
00
-- ```r # Load data census <- read_rds( here( "DataWork", "DataSets", "Final", "census.rds" ) ) ``` ??? So first thing, as usual, is make sure you are setting your folder paths so R knows where to find files and where to export them too. If you have already downloaded the DataWork folder from OSF, all you need to do now is edit this line of code to match your computer. I'll do it on mine in case you don't remember the exact steps. Then we will load the packages for today: tidyverse, as usual, and two new packages: modelsummary and huxtable, which we will learn about soon. Finally, let's load some data. This dataset is probably already familiar to most of you: it's Stata's built-in 1980 census data at state level. Double-check that you can see this dataset in your environment pane. Now, can someone tell me something about this dataset and how to get a little bit of information about it? We discussed it in the data wrangling session. --- # Taking a peek at the data ```r glimpse(census) ``` ``` ## Rows: 50 ## Columns: 13 ## $ state <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "California", "Co… ## $ state2 <chr> "AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "FL", "GA", "… ## $ region <fct> South, West, West, South, West, West, NE, South, South, South… ## $ pop <int> 3893888, 401851, 2718215, 2286435, 23667902, 2889964, 3107576… ## $ poplt5 <int> 296412, 38949, 213883, 175592, 1708400, 216495, 185188, 41151… ## $ pop5_17 <int> 865836, 91796, 577604, 495782, 4680558, 592318, 637731, 12544… ## $ pop18p <int> 2731640, 271106, 1926728, 1615061, 17278944, 2081151, 2284657… ## $ pop65p <int> 440015, 11547, 307362, 312477, 2414250, 247325, 364864, 59179… ## $ popurban <int> 2337713, 258567, 2278728, 1179556, 21607606, 2329869, 2449774… ## $ medage <dbl> 29.3, 26.1, 29.2, 30.6, 29.9, 28.6, 32.0, 29.8, 34.7, 28.7, 2… ## $ death <int> 35305, 1604, 21226, 22676, 186428, 18925, 26005, 5123, 104190… ## $ marriage <int> 49018, 5361, 30223, 26513, 210864, 34917, 26048, 4437, 108344… ## $ divorce <int> 26745, 3517, 19908, 15882, 133541, 18571, 13488, 2313, 71579,… ``` --- class: inverse, center, middle name: exploring # Quick summary statistics <html><div style='float:left'></div><hr color='#D38C28' size=1px width=1100px></html> --- # Exploring a dataset .command[ ## `summary(x, digits)` Equivalent to Stata's `codebook`. Its arguments are: * **x:** the object you want to summarize, usually a vector or data frame * *digits:* the number of decimal digits to be displayed ] -- .exercise[ ### Exercise 1
Use the `summary()` function to describe the `census` data frame. ]
−
+
00
:
45
--- # Exploring a dataset .small[ ```r summary(census) ``` ``` ## state state2 region pop ## Length:50 Length:50 NE : 9 Min. : 401851 ## Class :character Class :character N Cntrl:12 1st Qu.: 1169218 ## Mode :character Mode :character South :16 Median : 3066433 ## West :13 Mean : 4518149 ## 3rd Qu.: 5434033 ## Max. :23667902 ## poplt5 pop5_17 pop18p pop65p ## Min. : 35998 Min. : 91796 Min. : 271106 Min. : 11547 ## 1st Qu.: 98831 1st Qu.: 257949 1st Qu.: 823702 1st Qu.: 118660 ## Median : 227468 Median : 629654 Median : 2175130 Median : 370495 ## Mean : 326278 Mean : 945952 Mean : 3245920 Mean : 509503 ## 3rd Qu.: 361321 3rd Qu.:1143292 3rd Qu.: 3858173 3rd Qu.: 580087 ## Max. :1708400 Max. :4680558 Max. :17278944 Max. :2414250 ## popurban medage death marriage ## Min. : 172735 Min. :24.20 Min. : 1604 Min. : 4437 ## 1st Qu.: 826651 1st Qu.:28.73 1st Qu.: 9087 1st Qu.: 14840 ## Median : 2156905 Median :29.75 Median : 26177 Median : 36279 ## Mean : 3328253 Mean :29.54 Mean : 39474 Mean : 47701 ## 3rd Qu.: 3403450 3rd Qu.:30.20 3rd Qu.: 46533 3rd Qu.: 57338 ## Max. :21607606 Max. :34.70 Max. :186428 Max. :210864 ## divorce ## Min. : 2142 ## 1st Qu.: 6898 ## Median : 17113 ## Mean : 23679 ## 3rd Qu.: 27987 ## Max. :133541 ``` ] --- # Exploring a dataset <iframe src="https://rrmaximiliano.shinyapps.io/learnr-app/?showcase=0" width="100%" height="400px" data-external="1"></iframe> --- # Summarizing continuous variables .large[ - `summary()` can also be used with a single variable. - When used with continuous variables, it works similarly to `summarize` in Stata. - When used with categorical variables, it works similarly to `tabulate`. ] --- # Summarizing continuous variables .exercise[ ### Exercise 2
Use the `summary()` function to display summary statistics for a continuous variable in the `census` data frame. ]
−
+
00
:
45
??? Note that we have already summarized continuous and categorical variables when summarizing the entire data frame. But this is a reminder of how to select a single column inside a data frame. So choose are continuous variables you can find and summarize only that variable. --- # Summarizing continuous variables .exercise[ ### Exercise 2
Use the `summary()` function to display summary statistics for a continuous variable in the `census` data frame. ] ```r summary(census$pop) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 401851 1169218 3066433 4518149 5434033 23667902 ``` --- # Summarizing continuous variables <iframe src="https://rrmaximiliano.shinyapps.io/learnr-app/?showcase=0" width="100%" height="400px" data-external="1"></iframe> --- # Summarizing categorical variables .command[ ## `tabyl(x, ...)` Equivalent to `tabulate` in Stata, creates a frequency table. Its main arguments are vectors to be tabulated. * **x:** the object you want to summarize, usually a vector or data frame * *...* additional options as show_na, or show_missing_levels. ] -- .exercise[ ### Exercise 3
Use the `tabyl()` function to display frequency tables for: 1. The variable `region` in the `census` data frame 2. The variables `region` and `state` in the `census` data frame, simultaneously ]
−
+
01
:
00
--- # Summarizing categorical variables ## One way tabulation ```r census %>% tabyl(region) ```
region
n
percent
NE
9
0.18
N Cntrl
12
0.24
South
16
0.32
West
13
0.26
??? Note that this and other tables that we will create during this session look more polished in the presentation than when you print them to the console. That's because the commands have a pre-defined printing option for RMArkdown, which was used to create this presentation --- # Summarizing categorical variables ## Two way tabulation ```r census %>% tabyl(state, region) ```
state
NE
N Cntrl
South
West
Alabama
0
0
1
0
Alaska
0
0
0
1
Arizona
0
0
0
1
Arkansas
0
0
1
0
California
0
0
0
1
Colorado
0
0
0
1
Connecticut
1
0
0
0
??? Note that the output of `table` having a single variable as argument is the same as that of `summary`. However, `table` can also do two-way tabulations --- class: inverse, center, middle name: desc_tables # Descriptives tables <html><div style='float:left'></div><hr color='#D38C28' size=1px width=1100px></html> --- # Descriptives tables ## What if you want to... - ...export the summary statistics to another software? - ...customize which statistics to display? - ...format the table? -- ## Well, then you will need a few more packages - There are many packages that can be used both for displaying and exporting summary statistics - Today we will show you a combination of two packages: `modelsummary` and `huxtable` - We chose this combination because together, they can perform all the tasks we are interested in - In fact, `modelsummary` can perform most of them by itself -- with the exception of exporting formatted tables to Excel --- # Exploring datasets with *modelsummary* .large[ The package *modelsummary* contains a family of functions called `datasummary` which can be used to create different types of summary statistics tables. These include: - `datasummary_skim`, to create descriptive statistics tables - `datasummary_balance`, to create balance tables - `datasummary_correlation`, to create a correlation table - `datasummary_crosstab`, to create a twoway tabulation - `datasummary`, to create customized descriptive statistics tables ] --- # Exploring datasets with *modelsummary* .command[ ## `datasummary_skim(data, output, ....)` * **data:** the data set to be summarized, the only required argument * **output:** the type of output desired * *...:* additional options allow for formatting customization, such as including notes and titles ] ```r datasummary_skim( data, type = "numeric", output = "default", histogram = TRUE, title = NULL, notes = NULL, ... ) ``` --- # Exploring datasets with *modelsummary* .large[ .exercise[ ### Exercise 4
Use `datasummary_skim()` to create a descriptive statistics table for the `census` data. ] ]
−
+
00
:
45
--- # Exploring datasets with *modelsummary* ```r datasummary_skim(census) ``` .small[ <table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> Unique (#) </th> <th style="text-align:right;"> Missing (%) </th> <th style="text-align:right;"> Mean </th> <th style="text-align:right;"> SD </th> <th style="text-align:right;"> Min </th> <th style="text-align:right;"> Median </th> <th style="text-align:right;"> Max </th> <th style="text-align:right;"> </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> pop </td> <td style="text-align:right;"> 50 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 4518149.4 </td> <td style="text-align:right;"> 4715037.8 </td> <td style="text-align:right;"> 401851.0 </td> <td style="text-align:right;"> 3066433.0 </td> <td style="text-align:right;"> 23667902.0 </td> <td style="text-align:right;"> <svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" class="svglite" width="48.00pt" height="12.00pt" viewBox="0 0 48.00 12.00"><defs><style type="text/css"> .svglite line, .svglite polyline, .svglite polygon, .svglite path, .svglite rect, .svglite circle { fill: none; stroke: #000000; stroke-linecap: round; stroke-linejoin: round; stroke-miterlimit: 10.00; } .svglite text { white-space: pre; } </style></defs><rect width="100%" height="100%" style="stroke: none; fill: none;"></rect><defs><clipPath id="cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw"><rect x="0.00" y="0.00" width="48.00" height="12.00"></rect></clipPath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw)"> </g><defs><clipPath id="cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw"><rect x="0.00" y="2.88" width="48.00" height="9.12"></rect></clipPath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw)"><rect x="1.01" y="3.22" width="9.55" height="8.44" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="10.56" y="9.79" width="9.55" height="1.88" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="20.11" y="10.72" width="9.55" height="0.94" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="29.66" y="11.43" width="9.55" height="0.23" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="39.22" y="11.43" width="9.55" height="0.23" style="stroke-width: 0.38; fill: #000000;"></rect></g></svg> </td> </tr> <tr> <td style="text-align:left;"> poplt5 </td> <td style="text-align:right;"> 50 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 326277.8 </td> <td style="text-align:right;"> 331585.1 </td> <td style="text-align:right;"> 35998.0 </td> <td style="text-align:right;"> 227467.5 </td> <td style="text-align:right;"> 1708400.0 </td> <td style="text-align:right;"> <svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" class="svglite" width="48.00pt" height="12.00pt" viewBox="0 0 48.00 12.00"><defs><style type="text/css"> .svglite line, .svglite polyline, .svglite polygon, .svglite path, .svglite rect, .svglite circle { fill: none; stroke: #000000; stroke-linecap: round; stroke-linejoin: round; stroke-miterlimit: 10.00; } .svglite text { white-space: pre; } </style></defs><rect width="100%" height="100%" style="stroke: none; fill: none;"></rect><defs><clipPath id="cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw"><rect x="0.00" y="0.00" width="48.00" height="12.00"></rect></clipPath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw)"> </g><defs><clipPath id="cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw"><rect x="0.00" y="2.88" width="48.00" height="9.12"></rect></clipPath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw)"><rect x="0.82" y="3.22" width="5.32" height="8.44" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="6.14" y="4.83" width="5.32" height="6.84" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="11.45" y="9.65" width="5.32" height="2.01" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="16.77" y="10.46" width="5.32" height="1.21" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="22.08" y="11.26" width="5.32" height="0.40" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="27.40" y="10.86" width="5.32" height="0.80" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="32.71" y="11.66" width="5.32" height="0.00" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="38.03" y="11.66" width="5.32" height="0.00" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="43.34" y="11.26" width="5.32" height="0.40" style="stroke-width: 0.38; fill: #000000;"></rect></g></svg> </td> </tr> <tr> <td style="text-align:left;"> pop5_17 </td> <td style="text-align:right;"> 50 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 945951.6 </td> <td style="text-align:right;"> 959372.8 </td> <td style="text-align:right;"> 91796.0 </td> <td style="text-align:right;"> 629654.0 </td> <td style="text-align:right;"> 4680558.0 </td> <td style="text-align:right;"> <svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" class="svglite" width="48.00pt" height="12.00pt" viewBox="0 0 48.00 12.00"><defs><style type="text/css"> .svglite line, .svglite polyline, .svglite polygon, .svglite path, .svglite rect, .svglite circle { fill: none; stroke: #000000; stroke-linecap: round; stroke-linejoin: round; stroke-miterlimit: 10.00; } .svglite text { white-space: pre; } </style></defs><rect width="100%" height="100%" style="stroke: none; fill: none;"></rect><defs><clipPath id="cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw"><rect x="0.00" y="0.00" width="48.00" height="12.00"></rect></clipPath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw)"> </g><defs><clipPath id="cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw"><rect x="0.00" y="2.88" width="48.00" height="9.12"></rect></clipPath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw)"><rect x="0.89" y="3.22" width="4.84" height="8.44" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="5.73" y="5.00" width="4.84" height="6.67" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="10.57" y="8.55" width="4.84" height="3.11" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="15.42" y="10.77" width="4.84" height="0.89" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="20.26" y="9.88" width="4.84" height="1.78" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="25.10" y="11.66" width="4.84" height="0.00" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="29.95" y="11.22" width="4.84" height="0.44" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="34.79" y="11.22" width="4.84" height="0.44" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="39.63" y="11.66" width="4.84" height="0.00" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="44.47" y="11.22" width="4.84" height="0.44" style="stroke-width: 0.38; fill: #000000;"></rect></g></svg> </td> </tr> <tr> <td style="text-align:left;"> pop18p </td> <td style="text-align:right;"> 50 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 3245920.1 </td> <td style="text-align:right;"> 3430531.3 </td> <td style="text-align:right;"> 271106.0 </td> <td style="text-align:right;"> 2175130.0 </td> <td style="text-align:right;"> 17278944.0 </td> <td style="text-align:right;"> <svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" class="svglite" width="48.00pt" height="12.00pt" viewBox="0 0 48.00 12.00"><defs><style type="text/css"> .svglite line, .svglite polyline, .svglite polygon, .svglite path, .svglite rect, .svglite circle { fill: none; stroke: #000000; stroke-linecap: round; stroke-linejoin: round; stroke-miterlimit: 10.00; } .svglite text { white-space: pre; } </style></defs><rect width="100%" height="100%" style="stroke: none; fill: none;"></rect><defs><clipPath id="cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw"><rect x="0.00" y="0.00" width="48.00" height="12.00"></rect></clipPath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw)"> </g><defs><clipPath id="cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw"><rect x="0.00" y="2.88" width="48.00" height="9.12"></rect></clipPath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw)"><rect x="1.07" y="3.22" width="5.23" height="8.44" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="6.30" y="5.14" width="5.23" height="6.53" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="11.52" y="10.51" width="5.23" height="1.15" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="16.75" y="10.51" width="5.23" height="1.15" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="21.97" y="10.51" width="5.23" height="1.15" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="27.20" y="11.66" width="5.23" height="0.00" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="32.43" y="11.28" width="5.23" height="0.38" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="37.65" y="11.66" width="5.23" height="0.00" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="42.88" y="11.28" width="5.23" height="0.38" style="stroke-width: 0.38; fill: #000000;"></rect></g></svg> </td> </tr> <tr> <td style="text-align:left;"> pop65p </td> <td style="text-align:right;"> 50 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 509502.8 </td> <td style="text-align:right;"> 538932.4 </td> <td style="text-align:right;"> 11547.0 </td> <td style="text-align:right;"> 370495.0 </td> <td style="text-align:right;"> 2414250.0 </td> <td style="text-align:right;"> <svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" class="svglite" width="48.00pt" height="12.00pt" viewBox="0 0 48.00 12.00"><defs><style type="text/css"> .svglite line, .svglite polyline, .svglite polygon, .svglite path, .svglite rect, .svglite circle { fill: none; stroke: #000000; stroke-linecap: round; stroke-linejoin: round; stroke-miterlimit: 10.00; } .svglite text { white-space: pre; } </style></defs><rect width="100%" height="100%" style="stroke: none; fill: none;"></rect><defs><clipPath id="cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw"><rect x="0.00" y="0.00" width="48.00" height="12.00"></rect></clipPath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw)"> </g><defs><clipPath id="cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw"><rect x="0.00" y="2.88" width="48.00" height="9.12"></rect></clipPath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw)"><rect x="1.56" y="3.22" width="9.25" height="8.44" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="10.81" y="9.10" width="9.25" height="2.56" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="20.06" y="10.89" width="9.25" height="0.77" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="29.31" y="11.15" width="9.25" height="0.51" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="38.56" y="11.15" width="9.25" height="0.51" style="stroke-width: 0.38; fill: #000000;"></rect></g></svg> </td> </tr> <tr> <td style="text-align:left;"> popurban </td> <td style="text-align:right;"> 50 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 3328253.2 </td> <td style="text-align:right;"> 4090177.9 </td> <td style="text-align:right;"> 172735.0 </td> <td style="text-align:right;"> 2156905.0 </td> <td style="text-align:right;"> 21607606.0 </td> <td style="text-align:right;"> <svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" class="svglite" width="48.00pt" height="12.00pt" viewBox="0 0 48.00 12.00"><defs><style type="text/css"> .svglite line, .svglite polyline, .svglite polygon, .svglite path, .svglite rect, .svglite circle { fill: none; stroke: #000000; stroke-linecap: round; stroke-linejoin: round; stroke-miterlimit: 10.00; } .svglite text { white-space: pre; } </style></defs><rect width="100%" height="100%" style="stroke: none; fill: none;"></rect><defs><clipPath id="cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw"><rect x="0.00" y="0.00" width="48.00" height="12.00"></rect></clipPath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw)"> </g><defs><clipPath id="cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw"><rect x="0.00" y="2.88" width="48.00" height="9.12"></rect></clipPath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw)"><rect x="1.42" y="3.22" width="10.37" height="8.44" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="11.79" y="10.43" width="10.37" height="1.24" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="22.15" y="11.25" width="10.37" height="0.41" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="32.52" y="11.66" width="10.37" height="0.00" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="42.89" y="11.46" width="10.37" height="0.21" style="stroke-width: 0.38; fill: #000000;"></rect></g></svg> </td> </tr> <tr> <td style="text-align:left;"> medage </td> <td style="text-align:right;"> 37 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 29.5 </td> <td style="text-align:right;"> 1.7 </td> <td style="text-align:right;"> 24.2 </td> <td style="text-align:right;"> 29.8 </td> <td style="text-align:right;"> 34.7 </td> <td style="text-align:right;"> <svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" class="svglite" width="48.00pt" height="12.00pt" viewBox="0 0 48.00 12.00"><defs><style type="text/css"> .svglite line, .svglite polyline, .svglite polygon, .svglite path, .svglite rect, .svglite circle { fill: none; stroke: #000000; stroke-linecap: round; stroke-linejoin: round; stroke-miterlimit: 10.00; } .svglite text { white-space: pre; } </style></defs><rect width="100%" height="100%" style="stroke: none; fill: none;"></rect><defs><clipPath id="cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw"><rect x="0.00" y="0.00" width="48.00" height="12.00"></rect></clipPath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw)"> </g><defs><clipPath id="cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw"><rect x="0.00" y="2.88" width="48.00" height="9.12"></rect></clipPath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw)"><rect x="0.93" y="11.32" width="8.47" height="0.34" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="9.40" y="9.64" width="8.47" height="2.03" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="17.86" y="3.22" width="8.47" height="8.44" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="26.33" y="6.60" width="8.47" height="5.07" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="34.79" y="10.99" width="8.47" height="0.68" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="43.26" y="11.32" width="8.47" height="0.34" style="stroke-width: 0.38; fill: #000000;"></rect></g></svg> </td> </tr> <tr> <td style="text-align:left;"> death </td> <td style="text-align:right;"> 50 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 39474.3 </td> <td style="text-align:right;"> 41742.3 </td> <td style="text-align:right;"> 1604.0 </td> <td style="text-align:right;"> 26176.5 </td> <td style="text-align:right;"> 186428.0 </td> <td style="text-align:right;"> <svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" class="svglite" width="48.00pt" height="12.00pt" viewBox="0 0 48.00 12.00"><defs><style type="text/css"> .svglite line, .svglite polyline, .svglite polygon, .svglite path, .svglite rect, .svglite circle { fill: none; stroke: #000000; stroke-linecap: round; stroke-linejoin: round; stroke-miterlimit: 10.00; } .svglite text { white-space: pre; } </style></defs><rect width="100%" height="100%" style="stroke: none; fill: none;"></rect><defs><clipPath id="cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw"><rect x="0.00" y="0.00" width="48.00" height="12.00"></rect></clipPath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw)"> </g><defs><clipPath id="cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw"><rect x="0.00" y="2.88" width="48.00" height="9.12"></rect></clipPath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw)"><rect x="1.39" y="3.22" width="4.81" height="8.44" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="6.20" y="4.16" width="4.81" height="7.51" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="11.01" y="8.38" width="4.81" height="3.28" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="15.82" y="10.72" width="4.81" height="0.94" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="20.63" y="11.19" width="4.81" height="0.47" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="25.44" y="10.25" width="4.81" height="1.41" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="30.25" y="11.19" width="4.81" height="0.47" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="35.06" y="11.66" width="4.81" height="0.00" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="39.87" y="11.19" width="4.81" height="0.47" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="44.68" y="11.19" width="4.81" height="0.47" style="stroke-width: 0.38; fill: #000000;"></rect></g></svg> </td> </tr> <tr> <td style="text-align:left;"> marriage </td> <td style="text-align:right;"> 50 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 47701.4 </td> <td style="text-align:right;"> 45130.4 </td> <td style="text-align:right;"> 4437.0 </td> <td style="text-align:right;"> 36279.0 </td> <td style="text-align:right;"> 210864.0 </td> <td style="text-align:right;"> <svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" class="svglite" width="48.00pt" height="12.00pt" viewBox="0 0 48.00 12.00"><defs><style type="text/css"> .svglite line, .svglite polyline, .svglite polygon, .svglite path, .svglite rect, .svglite circle { fill: none; stroke: #000000; stroke-linecap: round; stroke-linejoin: round; stroke-miterlimit: 10.00; } .svglite text { white-space: pre; } </style></defs><rect width="100%" height="100%" style="stroke: none; fill: none;"></rect><defs><clipPath id="cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw"><rect x="0.00" y="0.00" width="48.00" height="12.00"></rect></clipPath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw)"> </g><defs><clipPath id="cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw"><rect x="0.00" y="2.88" width="48.00" height="9.12"></rect></clipPath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw)"><rect x="0.82" y="3.22" width="10.77" height="8.44" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="11.59" y="9.18" width="10.77" height="2.48" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="22.35" y="10.67" width="10.77" height="0.99" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="33.12" y="11.41" width="10.77" height="0.25" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="43.88" y="11.41" width="10.77" height="0.25" style="stroke-width: 0.38; fill: #000000;"></rect></g></svg> </td> </tr> <tr> <td style="text-align:left;"> divorce </td> <td style="text-align:right;"> 50 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 23679.4 </td> <td style="text-align:right;"> 25094.0 </td> <td style="text-align:right;"> 2142.0 </td> <td style="text-align:right;"> 17112.5 </td> <td style="text-align:right;"> 133541.0 </td> <td style="text-align:right;"> <svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" class="svglite" width="48.00pt" height="12.00pt" viewBox="0 0 48.00 12.00"><defs><style type="text/css"> .svglite line, .svglite polyline, .svglite polygon, .svglite path, .svglite rect, .svglite circle { fill: none; stroke: #000000; stroke-linecap: round; stroke-linejoin: round; stroke-miterlimit: 10.00; } .svglite text { white-space: pre; } </style></defs><rect width="100%" height="100%" style="stroke: none; fill: none;"></rect><defs><clipPath id="cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw"><rect x="0.00" y="0.00" width="48.00" height="12.00"></rect></clipPath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw)"> </g><defs><clipPath id="cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw"><rect x="0.00" y="2.88" width="48.00" height="9.12"></rect></clipPath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw)"><rect x="1.05" y="3.22" width="6.76" height="8.44" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="7.82" y="9.02" width="6.76" height="2.64" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="14.58" y="10.61" width="6.76" height="1.06" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="21.35" y="11.13" width="6.76" height="0.53" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="28.11" y="11.40" width="6.76" height="0.26" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="34.88" y="11.66" width="6.76" height="0.00" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="41.64" y="11.40" width="6.76" height="0.26" style="stroke-width: 0.38; fill: #000000;"></rect></g></svg> </td> </tr> </tbody> </table> ] --- # Exploring datasets with *modelsummary* <iframe src="https://rrmaximiliano.shinyapps.io/learnr-app/?showcase=0" width="100%" height="400px" data-external="1"></iframe> --- # Exploring datasets with *modelsummary* .large[ - *modelsummary* summarizes only numeric variables by default. - To summarize categorical variables, use the argument `type` ```r datasummary_skim(census, type = "categorical") ``` ] <table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> data </th> <th style="text-align:right;"> N </th> <th style="text-align:right;"> % </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> NE </td> <td style="text-align:right;"> 9 </td> <td style="text-align:right;"> 18.0 </td> </tr> <tr> <td style="text-align:left;"> N Cntrl </td> <td style="text-align:right;"> 12 </td> <td style="text-align:right;"> 24.0 </td> </tr> <tr> <td style="text-align:left;"> South </td> <td style="text-align:right;"> 16 </td> <td style="text-align:right;"> 32.0 </td> </tr> <tr> <td style="text-align:left;"> West </td> <td style="text-align:right;"> 13 </td> <td style="text-align:right;"> 26.0 </td> </tr> </tbody> </table> --- # Exploring datasets with *modelsummary* You can also customize the variables and statistics to include using a **formula** with the `datasummary()` function. .command[ ## `datasummary(formula, data, output, ...)` * **formula:** a two-sided formula to describe the table: rows ~ columns * **data:** the data set to be summarized * *output:* the type of output desired * *...:* additional options allow for formatting customization ] ```r datasummary( var1 + var2 + var3 ~ stat1 + stat2 + stat3 + stat4, data = data ) ``` --- # Exploring datasets with *modelsummary* .exercise[ ### Exercise 5
Create a table showing the number of observations, mean, standard deviation, minimum, maximum and median value for all the population, number of deaths, number of marriage and number of divorces in the `census` data. ] .can-edit[ ```r datasummary( var1 + var2 + var3 ~ stat1 + stat2 + stat3 + stat4, data = data ) ``` ] > **Tip:** some of the allowed statistics are N, Mean, SD, Min, Max, Median, P0, P25, P50, P75, P100, Histogram
−
+
01
:
30
--- # Exploring datasets with *modelsummary* .large[ ```r datasummary( pop + death + marriage + divorce ~ N + Mean + SD + Median + Min + Max, data = census ) ``` ] <table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> N </th> <th style="text-align:right;"> Mean </th> <th style="text-align:right;"> SD </th> <th style="text-align:right;"> Median </th> <th style="text-align:right;"> Min </th> <th style="text-align:right;"> Max </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> pop </td> <td style="text-align:right;"> 50 </td> <td style="text-align:right;"> 4518149.44 </td> <td style="text-align:right;"> 4715037.75 </td> <td style="text-align:right;"> 3066433.00 </td> <td style="text-align:right;"> 401851.00 </td> <td style="text-align:right;"> 23667902.00 </td> </tr> <tr> <td style="text-align:left;"> death </td> <td style="text-align:right;"> 50 </td> <td style="text-align:right;"> 39474.26 </td> <td style="text-align:right;"> 41742.35 </td> <td style="text-align:right;"> 26176.50 </td> <td style="text-align:right;"> 1604.00 </td> <td style="text-align:right;"> 186428.00 </td> </tr> <tr> <td style="text-align:left;"> marriage </td> <td style="text-align:right;"> 50 </td> <td style="text-align:right;"> 47701.40 </td> <td style="text-align:right;"> 45130.42 </td> <td style="text-align:right;"> 36279.00 </td> <td style="text-align:right;"> 4437.00 </td> <td style="text-align:right;"> 210864.00 </td> </tr> <tr> <td style="text-align:left;"> divorce </td> <td style="text-align:right;"> 50 </td> <td style="text-align:right;"> 23679.44 </td> <td style="text-align:right;"> 25094.01 </td> <td style="text-align:right;"> 17112.50 </td> <td style="text-align:right;"> 2142.00 </td> <td style="text-align:right;"> 133541.00 </td> </tr> </tbody> </table> --- # Exploring datasets with *modelsummary* .pull-left[ ```r datasummary( All(census) ~ N + Mean + SD + Median + Min + Max, data = census ) ``` ] .pull-right[ .small[ <table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> N </th> <th style="text-align:right;"> Mean </th> <th style="text-align:right;"> SD </th> <th style="text-align:right;"> Median </th> <th style="text-align:right;"> Min </th> <th style="text-align:right;"> Max </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> pop </td> <td style="text-align:right;"> 50 </td> <td style="text-align:right;"> 4518149.44 </td> <td style="text-align:right;"> 4715037.75 </td> <td style="text-align:right;"> 3066433.00 </td> <td style="text-align:right;"> 401851.00 </td> <td style="text-align:right;"> 23667902.00 </td> </tr> <tr> <td style="text-align:left;"> poplt5 </td> <td style="text-align:right;"> 50 </td> <td style="text-align:right;"> 326277.78 </td> <td style="text-align:right;"> 331585.14 </td> <td style="text-align:right;"> 227467.50 </td> <td style="text-align:right;"> 35998.00 </td> <td style="text-align:right;"> 1708400.00 </td> </tr> <tr> <td style="text-align:left;"> pop5_17 </td> <td style="text-align:right;"> 50 </td> <td style="text-align:right;"> 945951.60 </td> <td style="text-align:right;"> 959372.83 </td> <td style="text-align:right;"> 629654.00 </td> <td style="text-align:right;"> 91796.00 </td> <td style="text-align:right;"> 4680558.00 </td> </tr> <tr> <td style="text-align:left;"> pop18p </td> <td style="text-align:right;"> 50 </td> <td style="text-align:right;"> 3245920.06 </td> <td style="text-align:right;"> 3430531.31 </td> <td style="text-align:right;"> 2175130.00 </td> <td style="text-align:right;"> 271106.00 </td> <td style="text-align:right;"> 17278944.00 </td> </tr> <tr> <td style="text-align:left;"> pop65p </td> <td style="text-align:right;"> 50 </td> <td style="text-align:right;"> 509502.80 </td> <td style="text-align:right;"> 538932.38 </td> <td style="text-align:right;"> 370495.00 </td> <td style="text-align:right;"> 11547.00 </td> <td style="text-align:right;"> 2414250.00 </td> </tr> <tr> <td style="text-align:left;"> popurban </td> <td style="text-align:right;"> 50 </td> <td style="text-align:right;"> 3328253.18 </td> <td style="text-align:right;"> 4090177.93 </td> <td style="text-align:right;"> 2156905.00 </td> <td style="text-align:right;"> 172735.00 </td> <td style="text-align:right;"> 21607606.00 </td> </tr> <tr> <td style="text-align:left;"> medage </td> <td style="text-align:right;"> 50 </td> <td style="text-align:right;"> 29.54 </td> <td style="text-align:right;"> 1.69 </td> <td style="text-align:right;"> 29.75 </td> <td style="text-align:right;"> 24.20 </td> <td style="text-align:right;"> 34.70 </td> </tr> <tr> <td style="text-align:left;"> death </td> <td style="text-align:right;"> 50 </td> <td style="text-align:right;"> 39474.26 </td> <td style="text-align:right;"> 41742.35 </td> <td style="text-align:right;"> 26176.50 </td> <td style="text-align:right;"> 1604.00 </td> <td style="text-align:right;"> 186428.00 </td> </tr> <tr> <td style="text-align:left;"> marriage </td> <td style="text-align:right;"> 50 </td> <td style="text-align:right;"> 47701.40 </td> <td style="text-align:right;"> 45130.42 </td> <td style="text-align:right;"> 36279.00 </td> <td style="text-align:right;"> 4437.00 </td> <td style="text-align:right;"> 210864.00 </td> </tr> <tr> <td style="text-align:left;"> divorce </td> <td style="text-align:right;"> 50 </td> <td style="text-align:right;"> 23679.44 </td> <td style="text-align:right;"> 25094.01 </td> <td style="text-align:right;"> 17112.50 </td> <td style="text-align:right;"> 2142.00 </td> <td style="text-align:right;"> 133541.00 </td> </tr> </tbody> </table> ] ] --- # Balance tables with *modelsummary* ```r census_rct <- census %>% mutate( treatment = as.numeric(runif(n()) > 0.5) ) %>% select( -starts_with("state") ) datasummary_balance( ~ treatment, data = census_rct ) ``` --- # Balance tables with *modelsummary* .small[ <table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="empty-cells: hide;border-bottom:hidden;" colspan="2"></th> <th style="border-bottom:hidden;padding-bottom:0; padding-left:3px;padding-right:3px;text-align: center; " colspan="2"><div style="border-bottom: 1px solid #ddd; padding-bottom: 5px; ">0</div></th> <th style="border-bottom:hidden;padding-bottom:0; padding-left:3px;padding-right:3px;text-align: center; " colspan="2"><div style="border-bottom: 1px solid #ddd; padding-bottom: 5px; ">1</div></th> <th style="empty-cells: hide;border-bottom:hidden;" colspan="2"></th> </tr> <tr> <th style="text-align:left;"> </th> <th style="text-align:left;"> </th> <th style="text-align:right;"> Mean </th> <th style="text-align:right;"> Std. Dev. </th> <th style="text-align:right;"> Mean </th> <th style="text-align:right;"> Std. Dev. </th> <th style="text-align:right;"> Diff. in Means </th> <th style="text-align:right;"> Std. Error </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> pop </td> <td style="text-align:left;"> </td> <td style="text-align:right;"> 6127525.7 </td> <td style="text-align:right;"> 6824185.9 </td> <td style="text-align:right;"> 3828416.7 </td> <td style="text-align:right;"> 3351348.8 </td> <td style="text-align:right;"> -2299109.0 </td> <td style="text-align:right;"> 1850820.2 </td> </tr> <tr> <td style="text-align:left;"> poplt5 </td> <td style="text-align:left;"> </td> <td style="text-align:right;"> 421233.2 </td> <td style="text-align:right;"> 474619.5 </td> <td style="text-align:right;"> 285582.6 </td> <td style="text-align:right;"> 244984.0 </td> <td style="text-align:right;"> -135650.6 </td> <td style="text-align:right;"> 129353.6 </td> </tr> <tr> <td style="text-align:left;"> pop5_17 </td> <td style="text-align:left;"> </td> <td style="text-align:right;"> 1241312.6 </td> <td style="text-align:right;"> 1354292.9 </td> <td style="text-align:right;"> 819368.3 </td> <td style="text-align:right;"> 718505.4 </td> <td style="text-align:right;"> -421944.3 </td> <td style="text-align:right;"> 370167.5 </td> </tr> <tr> <td style="text-align:left;"> pop18p </td> <td style="text-align:left;"> </td> <td style="text-align:right;"> 4464979.9 </td> <td style="text-align:right;"> 5000055.2 </td> <td style="text-align:right;"> 2723465.8 </td> <td style="text-align:right;"> 2393654.8 </td> <td style="text-align:right;"> -1741514.1 </td> <td style="text-align:right;"> 1352924.9 </td> </tr> <tr> <td style="text-align:left;"> pop65p </td> <td style="text-align:left;"> </td> <td style="text-align:right;"> 710303.4 </td> <td style="text-align:right;"> 793981.0 </td> <td style="text-align:right;"> 423445.4 </td> <td style="text-align:right;"> 365507.6 </td> <td style="text-align:right;"> -286858.0 </td> <td style="text-align:right;"> 214112.3 </td> </tr> <tr> <td style="text-align:left;"> popurban </td> <td style="text-align:left;"> </td> <td style="text-align:right;"> 5063526.5 </td> <td style="text-align:right;"> 6138886.0 </td> <td style="text-align:right;"> 2584564.6 </td> <td style="text-align:right;"> 2587414.0 </td> <td style="text-align:right;"> -2478961.8 </td> <td style="text-align:right;"> 1644284.7 </td> </tr> <tr> <td style="text-align:left;"> medage </td> <td style="text-align:left;"> </td> <td style="text-align:right;"> 30.0 </td> <td style="text-align:right;"> 2.0 </td> <td style="text-align:right;"> 29.3 </td> <td style="text-align:right;"> 1.5 </td> <td style="text-align:right;"> -0.7 </td> <td style="text-align:right;"> 0.6 </td> </tr> <tr> <td style="text-align:left;"> death </td> <td style="text-align:left;"> </td> <td style="text-align:right;"> 53856.7 </td> <td style="text-align:right;"> 60274.9 </td> <td style="text-align:right;"> 33310.3 </td> <td style="text-align:right;"> 29745.6 </td> <td style="text-align:right;"> -20546.4 </td> <td style="text-align:right;"> 16354.9 </td> </tr> <tr> <td style="text-align:left;"> marriage </td> <td style="text-align:left;"> </td> <td style="text-align:right;"> 65459.0 </td> <td style="text-align:right;"> 59965.6 </td> <td style="text-align:right;"> 40091.0 </td> <td style="text-align:right;"> 35439.1 </td> <td style="text-align:right;"> -25368.0 </td> <td style="text-align:right;"> 16601.5 </td> </tr> <tr> <td style="text-align:left;box-shadow: 0px 1.5px"> divorce </td> <td style="text-align:left;box-shadow: 0px 1.5px"> </td> <td style="text-align:right;box-shadow: 0px 1.5px"> 32523.5 </td> <td style="text-align:right;box-shadow: 0px 1.5px"> 35141.1 </td> <td style="text-align:right;box-shadow: 0px 1.5px"> 19889.1 </td> <td style="text-align:right;box-shadow: 0px 1.5px"> 18701.3 </td> <td style="text-align:right;box-shadow: 0px 1.5px"> -12634.4 </td> <td style="text-align:right;box-shadow: 0px 1.5px"> 9608.3 </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:right;"> N </td> <td style="text-align:right;"> Pct. </td> <td style="text-align:right;"> N </td> <td style="text-align:right;"> Pct. </td> <td style="text-align:right;"> </td> <td style="text-align:right;"> </td> </tr> <tr> <td style="text-align:left;"> region </td> <td style="text-align:left;"> NE </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 26.7 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 14.3 </td> <td style="text-align:right;"> </td> <td style="text-align:right;"> </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:left;"> N Cntrl </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 13.3 </td> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 28.6 </td> <td style="text-align:right;"> </td> <td style="text-align:right;"> </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:left;"> South </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 20.0 </td> <td style="text-align:right;"> 13 </td> <td style="text-align:right;"> 37.1 </td> <td style="text-align:right;"> </td> <td style="text-align:right;"> </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:left;"> West </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 40.0 </td> <td style="text-align:right;"> 7 </td> <td style="text-align:right;"> 20.0 </td> <td style="text-align:right;"> </td> <td style="text-align:right;"> </td> </tr> </tbody> </table> ] --- class: inverse, center, middle name: exporting # Exporting tables <html><div style='float:left'></div><hr color='#D38C28' size=1px width=1100px></html> --- # Exporting *modelsummary* table to LaTeX To export the tables we created, we can simply use the option `output`: ```r descriptives <- All(census) ~ N + Mean + SD + Median + Min + Max datasummary( descriptives, data = census, * output = here( # file path to output file * "DataWork", * "Output", * "Raw", * "summary-stats.tex" * ) ) ``` --- # Exporting *modelsummary* table .large[ Other valid output formats include: * `.docx` * `.pptx` * `.html` * `.md` ] --- # Exporting *modelsummary* table .large[ Other valid output formats include: * `.docx` * `.pptx` * `.html` * `.md` * ... but not `.xls` ] --- # Exporting *modelsummary* table to Excel .pull-left[ .large[ - To export the table to Excel, we will first convert it into an object of type *huxtable* - `huxtable` is another R package, one that allows not only for exporting tables, but also for extensive customization - Before getting to the customization part, however, let's export this table: ] ] .pull-right[ ```r # Create the huxtable object summary_stats_table <- datasummary( descriptives, data = census, output = "huxtable" ) # Export it to Excel quick_xlsx( summary_stats_table, # object to be exported file = here( # file path to output file "DataWork", "Output", "Raw", "summary-stats.xlsx" ) ) ``` ] --- # Exporting tables A similar code can also export the same table to a self-standing LaTeX document ```r # Export to LaTeX quick_latex( summary_stats_table, file = here( "DataWork", "Output", "Raw", "summary-stats.tex" ) ) ``` --- # Exporting tables to different Excel tabs .small[ ```r # Start a new workbook wb <- createWorkbook() # Add one sheet to it wb <- as_Workbook( summary_stats_table, Workbook = wb, sheet = "Summary stats" ) # Add another sheet to it wb <- as_Workbook( hux("Mock", "table"), Workbook = wb, sheet = "Other sheet" ) # Save the workbook saveWorkbook( wb, # object to be saved file = here( # file path to output file "DataWork", "Output", "Raw", "summary-stats.xlsx" ), overwrite = TRUE # replace existing file ) ``` ] --- # Exporting tables to LaTeX fragment ```r summary_stats_table %>% print_latex() %>% # See LaTeX code # Save LaTeX code capture.output( file = here( "DataWork", "Output", "Raw", "summary-stats.tex" ) ) ``` You will also need to load the required LaTeX packages. To copy the code that creates a preamble with all of them, use this code: ```r report_latex_dependencies() ``` --- class: inverse, center, middle name: beautifying # Formatting tables <html><div style='float:left'></div><hr color='#D38C28' size=1px width=1100px></html> --- # Beautifying tables - `huxtable` also allows you to customize table formatting so it can be exported with the same layout to multiple software - Before we do that, however, we will create a version of the data where the variable names are the Stata labels ```r # Extract variable labels from data frame labels <- names(census) names(labels) <- attributes(census)$var.labels # Rename the variables census_labelled <- census %>% rename( labels ) # Create a labelled summary table summary_stats_table <- datasummary( All(census_labelled) ~ N + Mean + SD + Median + Min + Max, data = census_labelled, output = "huxtable" ) ``` --- # Beautifying tables The code below shows the table `summary_stats_table` can be formatted .pull-left[ ```r # Format table summary_stats_table %>% # Use first row as table header set_header_rows(1, TRUE) %>% # Use first column as row header set_header_cols(1, TRUE) %>% # Don't round large numbers set_number_format(everywhere, 2:ncol(.), "%9.0f") %>% # Centralize cells in first row set_align(1, everywhere, "center") %>% # Set a theme for quick formatting theme_basic() ``` ] .pull-right[ .small[
N
Mean
SD
Median
Min
Max
Population
50
4518149
4715038
3066433
401851
23667902
Pop, < 5 year
50
326278
331585
227468
35998
1708400
Pop, 5 to 17 years
50
945952
959373
629654
91796
4680558
Pop, 18 and older
50
3245920
3430531
2175130
271106
17278944
Pop, 65 and older
50
509503
538932
370495
11547
2414250
Urban population
50
3328253
4090178
2156905
172735
21607606
Median age
50
30
2
30
24
35
Number of deaths
50
39474
41742
26177
1604
186428
Number of marriages
50
47701
45130
36279
4437
210864
Number of divorces
50
23679
25094
17113
2142
133541
] ] --- # Export beautified tables ```r # Format table summary_stats_table <- summary_stats_table %>% set_header_rows(1, TRUE) %>% # Use first row as table header set_header_cols(1, TRUE) %>% # Use first column as row header set_number_format(everywhere, 2:ncol(.), "%9.0f") %>% # Don't round large numbers set_align(1, everywhere, "center") %>% # Centralize cells in first row theme_basic() # Set a theme for quick formatting quick_xlsx( summary_stats_table, file = here( "DataWork", "Output", "Raw", "summary-stats-basic.xlsx" ) ) ``` --- # Export beautified tables .pull-left[ ## .red[Before]  ] .pull-right[ ## .green[After]  ] --- # Other themes to play with {width=90%} --- class: inverse, center, middle name: regressing # Ok, can we run some regressions now?! <html><div style='float:left'></div><hr color='#D38C28' size=1px width=1100px></html> --- # Running regressions The base R command for linear regressions is called `lm` .command[ ## lm(formula, data, subset, weights, ...) - **formula:** an object of class "formula" containing a symbolic description of the model - **data:** a data frame containing the variables indicated in the formula - *subset:* an optional vector specifying a subset of observations to be used in the regression - *weights:* an optional vector of weights to be used in the regression ] Formulas can take three specifications: - `y ~ x1 + x2` regresses variable `y` on covariates `x1` and `x2` - `y ~ x1:x2` regresses variable `y` on the interaction of covariates `x1` and `x2` - `y ~ x1*x2` is equivalent to `y ~ x1 + x2 + x1:x2` --- # Running regressions .exercise[ ### Exercise 6
Using the `census` data, run a regression of the number of divorces on population, urban population and number of marriages. ] .can-edit[ ```r lm(y ~ x1 + x2, data) ``` ]
−
+
01
:
00
--- # Running regressions .exercise[ ### Exercise 6
Using the `census` data, run a regression of the number of divorces on population, urban population and number of marriages. ] ```r reg1 <- lm( divorce ~ pop + popurban + marriage, census ) ``` --- # Running regressions .large[ - The output of regression commands is a list of relevant information. - By default, it prints only a small portion of this information. - The best way to visualize results is to store this list in an object and then access its contents using the function `summary` ] --- # Running regressions .small[ ```r reg1 <- lm( divorce ~ pop + popurban + marriage, census ) summary(reg1) ``` ``` ## ## Call: ## lm(formula = divorce ~ pop + popurban + marriage, data = census) ## ## Residuals: ## Min 1Q Median 3Q Max ## -22892.3 -1665.1 796.5 4138.0 17212.2 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1.207e+02 1.838e+03 0.066 0.948 ## pop 1.044e-03 1.633e-03 0.639 0.526 ## popurban 1.954e-03 1.796e-03 1.088 0.282 ## marriage 2.587e-01 5.958e-02 4.342 7.7e-05 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 7466 on 46 degrees of freedom ## Multiple R-squared: 0.9169, Adjusted R-squared: 0.9115 ## F-statistic: 169.2 on 3 and 46 DF, p-value: < 2.2e-16 ``` ] --- # Running regressions The `feols` command from package `fixest` allows for more flexibility in model specification .command[ ## feols(formula, data, subset, weights, ...) - **formula:** an object of class "formula" containing a symbolic description of the model - **data:** a data frame containing the variables indicated in the formula - *vcov:* one of "iid", "hetero" (or "HC1"), "cluster", "twoway", "NW" (or "newey_west"), "DK" (or "driscoll_kraay"), or "conley" - *subset:* an optional vector specifying a subset of observations to be used in the regression - *weights:* an optional vector of weights to be used in the regression - *cluster:* a list of vectors, a character vector of variable names, a formula or an integer vector specifying how to cluster standard errors - ... ] --- # Running regressions Formulas for `feols` are more complex, and take the following format: `y ~ x1 + x2 | fe1 + fe2 | x3 ~ iv3` - `y ~ x1 + x2` takes all the same formulas as `lm` - `fe1 + fe2` list the variables to be included as fixed effects - `x3 ~ iv3` uses instrument `iv3` for variable `x3` --- # Running regressions .exercise[ ### Exercise 7
Using the `census` data, run a regression of the number of divorces on population, urban population and number of marriages controlling for region fixed effects. ] .can-edit[ ```r feols( y ~ x1 + x2 | fe1 + fe2, data ) ``` ]
−
+
01
:
00
--- # Running regressions .exercise[ ### Exercise 7
Using the `census` data, run a regression of divorce on population, urban population and number of marriages controlling for region fixed effects. ] ```r reg2 <- feols( divorce ~ pop + popurban + marriage | region, census, se = "iid" ) summary(reg2) ``` ??? Note that feols uses clusters standard errors by default. To avoid this behavior, se `se = "iid"` --- # Running regressions .small[ ```r reg2 <- feols( divorce ~ pop + popurban + marriage | region, census, se = "iid" ) summary(reg2) ``` ``` ## OLS estimation, Dep. Var.: divorce ## Observations: 50 ## Fixed-effects: region: 4 ## Standard-errors: IID ## Estimate Std. Error t value Pr(>|t|) ## pop 0.000395 0.001788 0.220981 0.8261531 ## popurban 0.003553 0.001998 1.778249 0.0824344 . ## marriage 0.183659 0.058027 3.165064 0.0028471 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## RMSE: 6,257.6 Adj. R2: 0.927695 ## Within R2: 0.935434 ``` ] --- # Some notes on regressions .large[ - Whenever a factor is included in the list of covariates, it is treated as a categorical variable, i.e., as if you had written `i.x` in Stata. - Whenever a boolean is included in the list of covariates, it is treated as a dummy variable, where `TRUE` is `1` and `FALSE` is `0`. ] --- class: inverse, center, middle name: reg_tables # Exporting regression tables <html><div style='float:left'></div><hr color='#D38C28' size=1px width=1100px></html> --- # Exporting regression tables `huxtable` also has a quick wrapper for regression tables .command[ ## huxreg(...) - **...**: Models, or a single list of models. Names will be used as column headings. - *number_format:* Format for numbering. See number_format() for details. - *stars:* Levels for p value stars. - *bold_signif:* Where p values are below this number, cells will be displayed in bold. - *note:* Footnote for bottom cell, which spans all columns. - *statistics:* A vector of summary statistics to display. - *coefs:* A vector of coefficients to display. To change display names, name the coef vector: c("Displayed title" = "coefficient_name", ...) ] --- # Exporting regression tables .small[ ```r huxreg(reg1, reg2) ```
(1)
(2)
(Intercept)
120.730
(1838.216)
pop
0.001
0.000
(0.002)
(0.002)
popurban
0.002
0.004
(0.002)
(0.002)
marriage
0.259 ***
0.184 **
(0.060)
(0.058)
N
50
50
R2
0.917
0.937
logLik
-514.766
-508.024
AIC
1039.531
1030.048
*** p < 0.001; ** p < 0.01; * p < 0.05.
] By now, you should know how to export this to Excel and LaTeX. --- # Formatting regression tables .pull-left[ ```r huxreg( reg1, reg2, # Show variable labels instead of names coefs = c( "Population" = "pop", "Urban population" = "popurban", "Number of marriages" = "marriage" ), statistics = c("N. obs." = "nobs")) %>% add_rows( c("Region FE", "No", "Yes"), after = 7 ) ``` ] .pull-right[
(1)
(2)
Population
0.001
0.000
(0.002)
(0.002)
Urban population
0.002
0.004
(0.002)
(0.002)
Number of marriages
0.259 ***
0.184 **
(0.060)
(0.058)
Region FE
No
Yes
N. obs.
50
50
*** p < 0.001; ** p < 0.01; * p < 0.05.
] --- # Exporting regression tables .exercise[ ### Exercise 8
Export a regression table with the results of your estimations using `lm` and `feols`: - Use `huxreg` to combine `reg1` and `reg2`. - Use `quick_xlsx` or `quick_latex` to export the output of `huxreg` to your preferred format. ]
−
+
01
:
00
-- .pull-left[ ```r # Combine regression results with huxreg reg_table <- huxreg(reg1, reg2) # Export to Excel quick_xlsx(reg_table, file = here( "DataWork", "Output", "Raw", "regression_table.xlsx")) ``` ] .pull-right[ ```r # Or, export to LaTeX quick_latex(reg_table, file = here( "DataWork", "Output", "Raw", "regression_table.tex")) ``` ] --- # References and recommendations * Econometrics with R https://www.econometrics-with-r.org/index.html * `modelsummary` documentation: https://vincentarelbundock.github.io/modelsummary/index.html * Introduction to `huxtable`: https://cran.r-project.org/web/packages/huxtable/vignettes/huxtable.html * Using `huxtable` for regression tables: https://cran.r-project.org/web/packages/huxtable/vignettes/huxreg.html * Sample code for tables in R: https://github.com/RRMaximiliano/r-latex-tables-sum-stats * More sample code for tables in R: https://evalsp20.classes.andrewheiss.com/reference/regtables/ * Johns Hopkins Exploratory Data Analysis at Coursera: https://www.coursera.org/learn/exploratory-data-analysis * Udacity's Data Analysis with R: https://www.udacity.com/course/data-analysis-with-r--ud651 ## Since we talked about LaTeX so much... * DIME LaTeX templates and trainings: https://github.com/worldbank/DIME-LaTeX-Templates * All you need to know about LaTeX: https://en.wikibooks.org/wiki/LaTeX --- class: inverse, center, middle # Thank you! --- class: inverse, center, middle # Appendix --- class: inverse, center, middle name: aggregating # Aggregating observations <html><div style='float:left'></div><hr color='#D38C28' size=1px width=1100px></html> --- # Aggregating observations .large[ - If you want to show aggregated statistics, the function `summarise` is a powerful tool. - It is similar to `datasummary` in that it calculates a series of statistics for a data frame. - However, it does not have pre-defined statistics, so it requires more manual input. - On the other hand, its output is a regular data frame, so it is also useful to create constructed data sets. - Its Stata equivalent would be `collapse` ] --- # Aggregating observations .command[ ## `summarise(.data, ...,)` - **data**: the data frame to be summarized - **...**: Name-value pairs of summary functions. The name will be the name of the variable in the result. ] -- The "name-value" pairs mentioned under `...` look like this: `new_variable = function(existing_variable)`, where possible functions include: - Center: `mean()`, `median()` - Spread: `sd()`, `IQR()`, `mad()` - Range: `min()`, `max()`, `quantile()` - Count: `n()`, `n_distinct()` --- # Aggregating observations .pull-left[ ```r region_stats <- census %>% group_by(region) %>% summarise( `Number of States` = n_distinct(state), `Total Population` = sum(pop) ) ``` ] .pull-right[ <table> <thead> <tr> <th style="text-align:left;"> region </th> <th style="text-align:right;"> Number of States </th> <th style="text-align:right;"> Total Population </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> NE </td> <td style="text-align:right;"> 9 </td> <td style="text-align:right;"> 49135283 </td> </tr> <tr> <td style="text-align:left;"> N Cntrl </td> <td style="text-align:right;"> 12 </td> <td style="text-align:right;"> 58865670 </td> </tr> <tr> <td style="text-align:left;"> South </td> <td style="text-align:right;"> 16 </td> <td style="text-align:right;"> 74734029 </td> </tr> <tr> <td style="text-align:left;"> West </td> <td style="text-align:right;"> 13 </td> <td style="text-align:right;"> 43172490 </td> </tr> </tbody> </table> ] --- # Aggregating observations .exercise[ ### Exercise 9
Recreate the `region_stats` data set, now including the average and the standard deviation of the population. ]
−
+
01
:
30
--- # Aggregating observations ```r region_stats <- census %>% group_by(region) %>% summarise( `Number of States` = n_distinct(state), `Total Population` = sum(pop), `Average Population` = mean(pop), `SD of Population` = sd(pop) ) ``` <table> <thead> <tr> <th style="text-align:left;"> region </th> <th style="text-align:right;"> Number of States </th> <th style="text-align:right;"> Total Population </th> <th style="text-align:right;"> Average Population </th> <th style="text-align:right;"> SD of Population </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> NE </td> <td style="text-align:right;"> 9 </td> <td style="text-align:right;"> 49135283 </td> <td style="text-align:right;"> 5459476 </td> <td style="text-align:right;"> 5925235 </td> </tr> <tr> <td style="text-align:left;"> N Cntrl </td> <td style="text-align:right;"> 12 </td> <td style="text-align:right;"> 58865670 </td> <td style="text-align:right;"> 4905473 </td> <td style="text-align:right;"> 3750094 </td> </tr> <tr> <td style="text-align:left;"> South </td> <td style="text-align:right;"> 16 </td> <td style="text-align:right;"> 74734029 </td> <td style="text-align:right;"> 4670877 </td> <td style="text-align:right;"> 3277853 </td> </tr> <tr> <td style="text-align:left;"> West </td> <td style="text-align:right;"> 13 </td> <td style="text-align:right;"> 43172490 </td> <td style="text-align:right;"> 3320961 </td> <td style="text-align:right;"> 6217177 </td> </tr> </tbody> </table> --- # Aggregating observations .exercise[ ### Exercise 9
Use `huxtable` to format and export the object `region_stats`. ]
−
+
02
:
00
--- # Aggregating observations ```r region_stats_table <- region_stats %>% rename(Region = region) %>% as_hux %>% set_header_cols("Region", TRUE) %>% theme_bright() quick_xlsx( region_stats_table, file = here( "DataWork", "Output", "Raw", "region-stats.xlsx" ) ) ```