Lecture 13

class: center, middle, inverse, title-slide

.title[
# Lecture 13
]
.subtitle[
## Introduction to Statistical Modeling
]
.author[
### Tyler Ransom
]
.date[
### ECON 5253, University of Oklahoma
]

---

# Plan for the day

- Overview of statistical modeling

- How to model outcomes of various varieties

- Briefly discuss the pros and cons of various modeling approaches

---
# Fundamentals of modeling
So, you have some data. Now you need to write down a model of how the variables in the data interact with each other

- Why do I need a model?

- because you're interested in either predicting some outcome variable (like sales)

- or because you're interested in understanding a causal relationship (e.g. between advertising and sales)

---
# The types of variables in a model

1. Outcome variable (typically denoted `\(y\)`), aka:

- response variable; dependent variable; target variable
    
2. Covariates (typically denoted `\(x\)`), aka:

- predictors; features; covariates; independent variables; control variables
    
3. Parameters (typically denoted `\(\beta\)` or `\(\theta\)`):

- map covariates into outcomes
    
4. Error term (typically denoted `\(\varepsilon\)`), aka:

- unobservables; distrubance term; noise

---
# What is a model?

- A .hi[model] is a mapping between covariates, parameters, unobservables, and outcomes

- It is the "production function" that generates the outcome we are interested in.

- Most generally, the model is:

`\begin{align*}
y &= f(x,\theta,\varepsilon)
\end{align*}`

- The model can be as simple as

`\begin{align*}
y &= \beta_0 + \beta_1x_1 + \beta_2x_2 + \varepsilon
\end{align*}`

or it can be as complex as

`\begin{align*}
y &= \beta_0 x_1^{(\beta_1)^{\beta_2}} x_2 e^\varepsilon
\end{align*}`

---
# Parametric vs. nonparametric models

- A model can even have a `\(\theta\)` that is .hi[infinite-dimensional]

- We call this a .hi[nonparametric] model

- (Equivalently, a nonparametric model is one in which we do not make any assumption about the distribution from which `\(\varepsilon\)` is drawn)

---
# The types of variables
Variables can be:

- continuous
- binary
- categorical (ordered)
- categorical (unordered)
- integers (i.e. counts)

Note that both dependent and independent variables can be of these types

---
# Modeling dependent variables

- For each type of dependent variable, we can come up with a statistical model to describe the relationship between the dependent variable and the covariates and unobservables

- All models fall under two umbrellas:

1. Parametric
2. Nonparametric

---
# Pros/Cons of parametric/nonparametric models

- .hi[Parametric models:]

- Pro: one can interpret the model by looking at the parameters

- Con: may not always be flexible enough to fit the data perfectly

- .hi[Nonparametric models:]

- Pro: usually fit the data better
    
    - Con: may not be readily interpretable

---
# Continuous dependent variables

- A variable is .hi[continuous] if it takes on any real number over some range (typically the entire real number line or the positive real numbers)

- .hi[Examples:] sales, earnings, number of page clicks, etc.

- The table below lists examples of parametric and nonparametric models appropriate for dependent variables that are continuous

| Parametric | Nonparametric |
|------------|---------------|
| OLS        | regression tree (forest, etc.) |
| Quantile regression | support vector machine (including k-nearest neighbor) |
| naive Bayes regression | Artificial Neural Network (ANN) |
|            | genetic programming (GP) |
|            | ... |

---
# Binary dependent variables

- A variable is .hi[binary] if it only takes on two values (without loss of generality: 0 or 1)

- .hi[Examples:] product was purchased by customer (or not), individual has cancer (or not), loan is in default (or not)

| Parametric | Nonparametric |
|------------|---------------|
| Logistic regression | classification tree (forest, etc.) |
| Probit regression | support vector classifier |
| naive Bayes classification | Artificial Neural Network (ANN) |
|            | genetic programming (GP) |
|            | ... |

---
# Linear Probabiliy Model vs. Logit
<img src="13slides_files/figure-html/lpm-graph-1.png" width="52%" style="display: block; margin: auto;" />

---
# Ordered categorical (aka Ordinal) dep. variables

- A variable is .hi[ordered categorical] if it takes on a finite (typically small) set of values 0, 1, ..., K and where .hi[order matters] (i.e. K > K-1 > K-2 > ... > 1 > 0)

- Values are mutually exclusive, so that each observation in the data belongs to one and only one category

- .hi[Example 1:] consumer has "interest" in a product by either: (0) not clicking on it; (1) clicking on it but not adding it to the cart; (2) adding it to the cart but not purchasing it; (3) purchasing it once; or (4) purchasing it multiple times

- .hi[Example 2:] a loan is either: (0) neither in default nor foreclosure; (1) in default but not in foreclosure; or (2) in foreclosure

---
# Models for ordinal dependent variables

| Parametric | Nonparametric |
|------------|---------------|
| Ordered logistic regression | ordinal trees |
| Ordered probit regression |support vector ordinal regression |
|            | ANN |
|            | GP |
|            | ... |

---
# Unordered categorical dependent variables

- A variable is .hi[unordered categorical] if it takes on a finite (typically small) set of values (without loss of generality: 0, 1, ..., K) but where *there is no inherent ordering among the categories*

- Values are mutually exclusive, so that each observation in the data belongs to one and only one category

- .hi[Example 1:] consumer can purchase: (0) no handbags; (1) Louis Vuitton handbags; (2) Coach handbags; (3) Other designer handbags; or (4) non-designer ("generic") handbags

- .hi[Example 2:] a person can choose to live in either: (0) Oklahoma City metro area; (1) Tulsa metro area; or (2) somewhere else in Oklahoma

---
# Models for unordered categorical dependent variables

- The algorithms here are exactly the same as for binary dependent variables, except that now there are multiple categories

| Parametric | Nonparametric |
|------------|---------------|
| Multinomial logistic regression | classification tree (forest, etc.) |
| Multinomial probit regression | support vector classifier |
| naive Bayes classification | Artificial Neural Network (ANN) |
|            | genetic programming (GP) |
|            | ... |

---
# Integer-valued dependent variables
- A variable is .hi[integer-valued] if it takes on values in the set of integers (0, 1, 2, ..., &infin;)

- Sometimes data that has this property is referred to as .hi[count data].

- .hi[Example 1:] consumer can smoke 0, 1, or more cigarettes in a day.

- .hi[Example 2:] a soccer team can score 0, 1, ..., 8 goals in a game.

- .hi[Example 3:] a city can experience 0, 1, or more vehicle accidents in a day.

---
# Integer-valued dependent variables

- Depending on the setting, some researchers will simply assume log-normality of the dependent variable in this case

- Again, a choice like this really depends on the exact case

- The parametric algorithms for modeling count data are a bit different from before, but the nonparametric algorithms are quite similar:

| Parametric | Nonparametric |
|------------|---------------|
| Poisson regression | regression tree (forest, etc.) |
| Negative binomial regression | support vector machine |
| zero-inflated count models | Artificial Neural Network (ANN) |
| zero-truncated count models | genetic programming (GP) |
|            | ... |

---
# Independent variables

As mentioned previously, we can have all kinds of independent variables

Some helpful things to know about independent variables:

- don't treat an ordered categorical variable as a continuous variable
- use "one-hot" encoding of categorical variables (equivalently `as.factor()` in R)
- if using a liner regression model, it is sometimes helpful to create polynomial functions of continuous covariates
- interacting two binary covariates (or a binary and a continuous covariate) can also be helpful
- How you specify your right hand side variables depends a lot on the goal of your model (i.e. prediction vs. causality)]

---
# Observational vs. Experimental data
.hi[Observational data] is data that has been collected for no particular purpose

- e.g. Census household survey data, twitter stream, Dow Jones stock prices, etc.

.hi[Experimental data] is data that scientists have set up where certain units are randomly assigned to take on certain values of a variable of interest (the .hi[treatment variable])

.hi[Causal inference] (as opposed to prediction) is the idea that we can figure out if `\(x\)` causes `\(y\)` by comparing units that were treated with those that were not treated

Causal inference requires knowing the "counterfactual": "What would have happened to units in the control group if they had been treated?"

---
# Estimating statistical models in R, Python, and Julia

Libraries and functions (`library::function`)

| Algorithm | R | Python | Julia | 
|----------|----|--------|-------|
| OLS                        | `base::lm` | `statsmodels::OLS` | `GLM::glm` |
| trees                      | `rpart::rpart` | `sklearn::tree` | `DecisionTree::build_tree` |
| k-nearest neighbor         | `caret::knn3`  | `sklearn::KNeighborsClassifier` |  `NearestNeighbors::knn`     |
| SVM                        | `e1071::svm`  | `sklearn::svm` |  `LIBSVM::svmtrain`     |
| naive Bayes                | `e1071::naiveBayes`   | `sklearn::GaussianNB` | `NaiveBayes::HybridNB` |
| ANN                        | `nnet::nnet` | `sklearn::neural_network` | `Knet::train` |
| genetic prog.              | `rgp::geneticProgramming` | `gplearn::est_gp.fit` | `GeneticAlgorithms::runga` |
| logistic/probit regression | `base::glm`   | `sklearn::linear_model.LogisticRegression` | `GLM::glm` |
| ordered logit/probit       | `MASS:polr` | `mord::LogisticIT` | `LowRankModels::OrderedMultinomialLoss` |
| multionmial logit/probit   | `nnet::multinom` | same as logistic | `SciKitLearn::LogisticRegression` |
| Poisson regression         | `base::glm` | `statsmodels::GLM` | `GLM::glm` |

---
# Useful Links

- [Multiclass classification](https://en.wikipedia.org/wiki/Multiclass_classification)

- [One-hot encoding](https://en.wikipedia.org/wiki/One-hot)

- [Genetic programming example in R](https://www.kaggle.com/olegtereshkin/genetic-programming-r-example/code)