class: center, middle, inverse, title-slide .title[ # Econometrics ] .subtitle[ ## Probability: ๐ฒ, ๐ฎ, ๐ช ] .author[ ### Florian Oswald ] .date[ ### UniTo ESOMAS 2025-09-17 ] --- layout: true <div class="my-footer"><img src="../img/logo/unito-shield.png" style="height: 60px;"/></div> --- # Probability 1. Probabilities and RVs 2. PDFs and CDFs of discrete 3. PDFs and CDFs of cont 4. E, Var 5. funcs of RVs 6. standardizing RVs 7. joint and marginal dists 8. conditional dists, independence, covariance, corr 9. LIE 10. correlation and conditional mean 11. KC 2.3 12. random sampling and distributions of resulting statistics --- # Probabilities *Probability* is a function that assigns a value in `\([0,1]\)` to a *set* (representing *events*). Consider a fair 6-sided dice. ๐ฒ * Outcome space: `\(\Omega = \{1,2,3,4,5,6\}\)` * Event: a partition of `\(\Omega\)`. -- For example, the events A ("the outcome is even") and B ("the outcome is odd") are: `$$\begin{align} A &= \{2,4,6\},\\ B &= \{1,3,5\}. \end{align}$$` Having a *fair* dice means that `$$\Pr(1) = \dots = \Pr(6) = \frac{1}{6}$$` --- class: inverse # Probabilities Given a fair 6-sided dice ๐ฒ, i.e. an outcome space `\(\Omega = \{1,2,3,4,5,6\}\)` and events `$$\begin{align} A &= \{2,4,6\},\\ B &= \{1,3,5\}. \end{align}$$` how does one determine the ***likelihood*** (or the **probability**) of event A occuring? --- # Discrete Random Variables A *Discrete Random Variable* is a function mapping outcomes to measurements. For example, we might call * `\(X\)` the number we obtain from throwing the dice once. * `\(X_1,X_2\)` the 2 numbers we obtain from throwing the dice two times. * `\(D\)` an indicator (either `0`, `1`) for whether a randomly sampled person answers "yes" or "no" when asked whether they have children. --- class: inverse # Gambler's Ruin The dealer tosses a fair coin. If it comes up tails (`T`) you win, if it's heads (`H`)you loose. Suppose we've seen the following sequence of tosses so far: ``` outcome: H H H H H H H H H H ? toss num: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. ``` You have lost 10 times in a row now. Does this increase the probability that you will win the next round? -- Which sequence is more likely to occur? 1. `TTTTTTTTTT` (10 `T` in a row)? 2. `HTHTTHTHHT` (random occurence of `T` and `H`) --- ## Gambler's Ruin The probability of hitting 11 `H` in a row is $$ \Pr(\text{11 Heads in a Row}) = \frac{1}{2^{11}} = 0.00048 = 0.05\% $$ -- Ok, but I already had 10 `H`! What's the next toss likely going to be given that miserable history? $$ \Pr(\text{toss 11 is Heads} | 10 \text{ Heads before}) $$ Let's calculate that probability on the board! --- layout: false class: title-slide-section-red, middle # Probability Distributions --- layout: true <div class="my-footer"><img src="../img/logo/unito-shield.png" style="height: 60px;"/></div> --- ## PDF/PMF and CDF of discrete RVs * PDF/PMF: Probability Density/Mass Function. Table listing each outcome and the associated probability of observing it. * CDF: Cumulative Distribution Function. Probability that a given RV takes on a value *less than or equal a certain value* (requires a notion of *ordering* - e.g. what about a dice with 6 different **colors**?) ### For our 6-sided fair dice ``` ## outcome pdf cdf ## 1 1 1/6 1/6 ## 2 2 1/6 2/6 ## 3 3 1/6 3/6 ## 4 4 1/6 4/6 ## 5 5 1/6 5/6 ## 6 6 1/6 6/6 ``` --- ## PDF/PMF and CDF of discrete RVs * PDF/PMF: Probability Density/Mass Function. Table listing each outcome and the associated probability of observing it. * CDF: Cumulative Distribution Function. Probability that a given RV takes on a value *less than or equal a certain value* (requires a notion of *ordering* - e.g. what about a dice with 6 different **colors**?) ### For our 6-sided fair dice .pull-left[ <img src="chapter_probability_files/figure-html/unnamed-chunk-2-1.svg" style="display: block; margin: auto;" /> ] .pull-right[ <img src="chapter_probability_files/figure-html/unnamed-chunk-3-1.svg" style="display: block; margin: auto;" /> ] --- ## PDF/PMF and CDF of discrete RVs * PDF/PMF: Probability Density/Mass Function. Table listing each outcome and the associated probability of observing it. * CDF: Cumulative Distribution Function. Probability that a given RV takes on a value *less than or equal a certain value* (requires a notion of *ordering* - e.g. what about a dice with 6 different **colors**?) ### For our fair coin ``` ## outcome pdf cdf ## 1 H(0) 1/2 1/2 ## 2 T(1) 1/2 2/2 ``` --- ## PDF/PMF and CDF of discrete RVs * PDF/PMF: Probability Density/Mass Function. Table listing each outcome and the associated probability of observing it. * CDF: Cumulative Distribution Function. Probability that a given RV takes on a value *less than or equal a certain value* (requires a notion of *ordering* - e.g. what about a dice with 6 different **colors**?) ### For our fair coin .pull-left[ <img src="chapter_probability_files/figure-html/unnamed-chunk-5-1.svg" style="display: block; margin: auto;" /> ] .pull-right[ <img src="chapter_probability_files/figure-html/unnamed-chunk-6-1.svg" style="display: block; margin: auto;" /> ] --- ## Bernoulli Distribution * A special discrete RV with 2 outcomes: `0` and `1`, where event `1` ("success") occurs with probability `\(p\)`. For example: * Tomorrow it will rain with probability `\(p\)` (it will *not* rain with `\(1-p\)`). -- What kind of RV is flipping a *fair* coin? What about an *unfair* coin? --- ## PDFs and CDFs of continuous Variables * slightly more complicated because we need calculus and integration. * The basic idea is the same! --- layout: false class: title-slide-section-red, middle # Expected Value and Variance --- layout: true <div class="my-footer"><img src="../img/logo/unito-shield.png" style="height: 60px;"/></div> --- ## Expected Value and Variance We write `$$E(Y) = y_1 p_1 + \dots + y_n p_n$$` For example for our dice example, where `\(X\)` is the result of throwing the dice: `$$\begin{align} E(X) &= 1 \times \Pr(1) + 2 \times \Pr(2) + \dots + 6 \times \Pr(6) \\ &= \end{align}$$` **Everybody calculate this result now!** --- # Expected Value and Variance We write $$ E(Y) = y_1 p_1 + \dots + y_n p_n $$ For example for our dice example, where `\(X\)` is the result of throwing the dice: `$$\begin{align} E(X) &= 1 \times \Pr(1) + 2 \times \Pr(2) + \dots + 6 \times \Pr(6) \\ &= 1 \times \frac{1}{6} + 2 \times \frac{1}{6} + \dots + 6 \times \frac{1}{6} \\ &= 3.5 \\ \end{align}$$` * So: We need to know ***all*** weights and values (`\(\Pr\)` and `\(X\)`) in order to compute this quantity. --- # `\(E(X)\)` is a Theoretical Concept. Mission Impossible? * Imagine you get this message: -- 1. Your mission, should you choose to accept it, is to inspect this device : ๐ฎ. Whenever you touch it, it displays a number in `\(\{1,\dots,10\}\)`. E.g. `4,8,4,9,1,5,10,9,1`...kind of random. 2. We must know the long-run average, i.e. E(๐ฎ). Or something terrible will happen! 3. This message will destroy itself in 10,9,8,...๐ฃ -- * You **don't know** the theoretical distribution of all the numbers in ๐ฎ (the `\(\Pr\)`'s). Time is running โฐ -- * What could you do? ๐ค --- <iframe src="https://giphy.com/embed/plVdDRfj5WV47sIAsh" width="960" height="538" style="" frameBorder="0" class="giphy-embed" allowFullScreen></iframe> --- # `\(E(X)\)` is a Theoretical Concept. Mission Impossible? * If we had a huge number of observations from ๐ฎ (a *sample*), we could come up with a *guess* of E(๐ฎ), based on the data we got. Empirical, like. ``` r set.seed(12345) # to ensure reproducibility Ps = runif(10) Ps[c(2,3)] = 0 Ps = Ps / sum(Ps) # generate random weights x = sample(1:10, size = 10000, replace = TRUE, prob = Ps) xbar = mean(x) xbar ``` ``` ## [1] 6.3418 ``` -- * `xbar` = `\(\bar{x} = \frac{1}{N} \sum_i^{N} x_i\)` is called *Sample Mean*, *Arithmetic Average*, *Sample Average* * `\(\bar{x}\)` is an ***estimator*** for `\(E(X) = \mu_X\)`. (`\(E(X) = \mu_X\)` by the way.) --- # Central Tendency - Mean and Median .pull-left[ `mean(x)`: the average of all values in `x`. `$$E(X) = \mu_X = \frac{1}{N}\sum_{i=1}^N x_i = \bar{x}$$` ***The second equality ๐ is correct only if...???*** ``` r x <- c(1,2,2,2,2,100) mean(x) ``` ``` ## [1] 18.16667 ``` ``` r mean(x) == sum(x) / length(x) ``` ``` ## [1] FALSE ``` ] -- .pull-right[ `median`: the value `\(x_j\)` below and above which 50% of the values in `x` lie. `\(m\)` is the median if `$$\Pr(X \leq m) \geq 0.5 \text{ and } \Pr(X \geq m) \geq 0.5$$` The median is robust against *outliers*. ``` r median(x) ``` ``` ## [1] 2 ``` ] --- ## Quick Review 1. EV of a bernoulli 2. Continuous RV --- ## Variance * A measure of *spread* of a distribution. * The definition of *variance* is `$$var(Y) = E[(Y-\mu_Y)^2]$$` if `\(Y\)` is discrete, `$$var(Y) = \sum_{i=1}^N (y_i-\mu_Y)^2 \times p_i.$$` --- ## Variance .pull-left[ Consider two `normal distributions` with equal mean at `0`: ] -- .pull-right[ <img src="chapter_probability_files/figure-html/unnamed-chunk-11-1.svg" style="display: block; margin: auto;" /> Compute with: ``` r var(x) ``` ] --- ## Example of other Spread Measures .pull-left[ ``` r # % catholic in 47 french-speaking # swiss cantons in 1888 plot(swiss$Catholic,rep(1,nrow(swiss)),pch = 3, cex = 2,xlab = "% Catholic",yaxt = "n",ylab = "") ``` <img src="chapter_probability_files/figure-html/unnamed-chunk-13-1.svg" style="display: block; margin: auto;" /> ] .pull-right[ How do the values in column `Catholic` *vary*? | Measure | `R` | Result | |:---------:|:-------------------:|:---------------------:| | Variance | `var(swiss$Catholic)` | 1739.29 | | Standard Deviation | `sd(swiss$Catholic)` | 41.7 | | IQR | `IQR(swiss$Catholic)` | 87.93 | | Minimum | `min(swiss$Catholic)` | 2.15 | | Maximum | `max(swiss$Catholic)` | 100 | | Range | `range(swiss$Catholic)` | 2.15, 100 | ] --- # Computing Variance by hand 1. Compute the variance of our 6-sided dice! 2. compute the variance of `\(X \sim \text{Bernoulli}(p)\)`! --- layout: false class: title-slide-section-red, middle # Two Random Variables <br> `\((X,Y)\)`, (๐ฒ, ๐ฒ), (๐ฒ, ๐ช), (๐ช, ๐ช) --- layout: true <div class="my-footer"><img src="../img/logo/unito-shield.png" style="height: 60px;"/></div> --- # Two Random Variables: Example .pull-left[ <img src="chapter_probability_files/figure-html/x-y-corr-1.svg" style="display: block; margin: auto;" /> ] .pull-right[ * Here, `\(X\)` and `\(Y\)` are ***joint normally*** distributed. * We would write $$(X, Y) \sim \mathcal{N}\biggl( \begin{bmatrix} \mu_X \\ \mu_Y \end{bmatrix}, \, \Sigma \biggr)$$` where `\(\Sigma\)` is a *matrix* `$$\begin{bmatrix} \sigma_X^2 & \rho \, \sigma_X \sigma_Y \\ \rho \, \sigma_X \sigma_Y & \sigma_Y^2 \end{bmatrix}$$` ] Taking as example the data in this plot, the concepts *covariance* and *correlation* relate to the following type of question: # missing joint and marginal dists 8. conditional dists, independence, covariance, corr 9. LIE 10. correlation and conditional mean create/get a table and illustrate --- # Tabulating Data `table(x)` is a useful function that counts the occurence of each unique value in `x`: ``` r table(gapminder$continent) ``` ``` ## ## Africa Americas Asia Europe Oceania ## 2907 2052 2679 2223 684 ``` -- The same can be achieved using the `count` function (from `dplyr`) ``` r gapminder %>% count(continent) ``` ``` ## continent n ## 1 Africa 2907 ## 2 Americas 2052 ## 3 Asia 2679 ## 4 Europe 2223 ## 5 Oceania 684 ``` --- # Tabulating Data Given two variables, `table` produces a contingency table: ``` r gapminder_new <- gapminder %>% filter(year == 2015) %>% mutate(fertility_above_2 = (fertility > 2.1)) # dummy variable for fertility rate above replacement rate ``` .pull-left[ ``` r table(gapminder_new$fertility_above_2) ``` ``` ## ## FALSE TRUE ## 80 104 ``` ] -- .pull-right[ ``` r table(gapminder_new$fertility_above_2,gapminder_new$continent) ``` ``` ## ## Africa Americas Asia Europe Oceania ## FALSE 2 15 20 39 4 ## TRUE 49 20 27 0 8 ``` ] -- With `prop.table`, we can get proportions: ``` r # proportions by row prop.table(table(gapminder_new$fertility_above_2,gapminder_new$continent), margin = 1) # proportions by column prop.table(table(gapminder_new$fertility_above_2,gapminder_new$continent), margin = 2) ``` * โ ๏ธ To obtain `table`s or `crosstable`s with `NA`s, use the `useNA = "always"` or `useNA = "ifany"` --- # Tabulating Data Again the `count` function can get you there as well: ``` r gapminder_new %>% count(continent, fertility_above_2) ``` ``` ## continent fertility_above_2 n ## 1 Africa FALSE 2 ## 2 Africa TRUE 49 ## 3 Americas FALSE 15 ## 4 Americas TRUE 20 ## 5 Americas NA 1 ## 6 Asia FALSE 20 ## 7 Asia TRUE 27 ## 8 Europe FALSE 39 ## 9 Oceania FALSE 4 ## 10 Oceania TRUE 8 ``` Note that `count` will display `NA`s only if there are any. ``` r gapminder_new %>% filter(is.na(fertility)) %>% select(country) ``` ``` ## country ## 1 Greenland ``` --- # Joint Distributions .pull-left[ ``` r abs_tab ``` ``` ## ## FALSE TRUE ## Africa 2 49 ## Americas 15 20 ## Asia 20 27 ## Europe 39 0 ## Oceania 4 8 ``` There are 184 countries. ] .pull-right[ ``` r prop_tab ``` ``` ## ## FALSE TRUE ## Africa 0.01 0.27 ## Americas 0.08 0.11 ## Asia 0.11 0.15 ## Europe 0.21 0.00 ## Oceania 0.02 0.04 ``` What *fraction* of countries is in each bin? * ๐The *joint distribution*. * `\(\Pr(X,Y)\)` where X = `continent` and Y = `fertility_above_2` ] -- ๐ The *conditional distribution* is `\(\Pr(X=x|Y=y) = \frac{\Pr(X=x,Y=y)}{\Pr(Y=y)}\)` --- class: inverse # Computing Conditional Distributions
−
+
05
:
00
.pull-left[ ``` r prop_tab ``` ``` ## ## FALSE TRUE ## Africa 0.01 0.27 ## Americas 0.08 0.11 ## Asia 0.11 0.15 ## Europe 0.21 0.00 ## Oceania 0.02 0.04 ``` ] .pull-right[ 1. Compute the Distribution of countries conditional on low fertility? 2. Compute the Distribution of countries conditional on being in Asia? 3. Compute the marginal distribution of high/low fertility ] --- # How are x and y related? Covariance and Correlation --- # Covariance * The covariance is a measure of __joint variability__ of two variables. `$$Cov(x,y) = \frac{1}{N} \sum_{i=1}^N(x_i-\bar{x})(y_i-\bar{y})$$` -- * The `cov` function computes the covariance: ``` r cov(gapminder_new$fertility,gapminder_new$infant_mortality, use = "complete.obs") ``` ``` ## [1] 24.21146 ``` -- * Difficult to interpret because sensitive to the variables' dispersions from the mean --- # Correlation * The correlation is a measure of the strength of the __linear association__ between two variables. `$$Cor(x,y) = \frac{Cov(x,y)}{\sqrt{Var(x)}\sqrt{Var(y)}}$$` -- * The `cor` function computes the correlation: ``` r cor(gapminder_new$fertility,gapminder_new$infant_mortality, use = "complete.obs") ``` ``` ## [1] 0.8286402 ``` --- # Correlation * **Correlation is always between -1 and 1!** -- .footnote[ *Source: [mathisfun](https://www.mathsisfun.com/data/correlation.html)* ] --- class: inverse # Task 4: Summarising data
−
+
10
:
00
1. Compute the mean of GDP in 2011 and assign to object `mean`. You should exclude missing values. (*Hint: read the help for `mean` to remove `NA`s*). 1. Compute the median of GDP in 2011 and assign to object `median`. Again, you should exclude missing values. Is it greater or smaller than the average? 1. Create a density plot of GDP in 2011 using `geom_density`. A density plot is a way of representing the distribution of a numeric variable. Add the following code to your plot to show the median and mean as vertical lines. What do you observe? `geom_vline(xintercept = as.numeric(mean), colour = "red") +` <br> `geom_vline(xintercept = as.numeric(median), colour = "orange")` 1. Compute the correlation between fertility and infant mortality in 2015. To drop `NA`s in either variable set the argument `use` to "pairwise.complete.obs" in your `cor()` function. Is this correlation consistent with the graph you produced in Task 3? --- class: title-slide-final, middle background-image: url(../img/logo/esomas.png) background-size: 250px background-position: 9% 19% # That's all for Probability! | | | | :--------------------------------------------------------------------------------------------------------- | :-------------------------------- | | <a href="https://github.com/floswald/Econometrics-Slides">.ScPored[<i class="fa fa-link fa-fw"></i>] | Slides | | <a href="https://floswald.github.io">.ScPored[<i class="fa fa-link fa-fw"></i>] | My Homepage | | <a href="https://scpoecon.github.io/ScPoEconometrics/">.ScPored[<i class="fa fa-github fa-fw"></i>] | Book |