class: center, middle, inverse, title-slide .title[ # Lecture 13 ] .subtitle[ ## Introduction to Statistical Modeling ] .author[ ### Tyler Ransom ] .date[ ### ECON 5253, University of Oklahoma ] --- # Plan for the day - Overview of statistical modeling - How to model outcomes of various varieties - Briefly discuss the pros and cons of various modeling approaches --- # Fundamentals of modeling So, you have some data. Now you need to write down a model of how the variables in the data interact with each other - Why do I need a model? - because you're interested in either predicting some outcome variable (like sales) - or because you're interested in understanding a causal relationship (e.g. between advertising and sales) --- # The types of variables in a model 1. Outcome variable (typically denoted `\(y\)`), aka: - response variable; dependent variable; target variable 2. Covariates (typically denoted `\(x\)`), aka: - predictors; features; covariates; independent variables; control variables 3. Parameters (typically denoted `\(\beta\)` or `\(\theta\)`): - map covariates into outcomes 4. Error term (typically denoted `\(\varepsilon\)`), aka: - unobservables; distrubance term; noise --- # What is a model? - A .hi[model] is a mapping between covariates, parameters, unobservables, and outcomes - It is the "production function" that generates the outcome we are interested in. - Most generally, the model is: `\begin{align*} y &= f(x,\theta,\varepsilon) \end{align*}` - The model can be as simple as `\begin{align*} y &= \beta_0 + \beta_1x_1 + \beta_2x_2 + \varepsilon \end{align*}` or it can be as complex as `\begin{align*} y &= \beta_0 x_1^{(\beta_1)^{\beta_2}} x_2 e^\varepsilon \end{align*}` --- # Parametric vs. nonparametric models - A model can even have a `\(\theta\)` that is .hi[infinite-dimensional] - We call this a .hi[nonparametric] model - (Equivalently, a nonparametric model is one in which we do not make any assumption about the distribution from which `\(\varepsilon\)` is drawn) --- # The types of variables Variables can be: - continuous - binary - categorical (ordered) - categorical (unordered) - integers (i.e. counts) Note that both dependent and independent variables can be of these types --- # Modeling dependent variables - For each type of dependent variable, we can come up with a statistical model to describe the relationship between the dependent variable and the covariates and unobservables - All models fall under two umbrellas: 1. Parametric 2. Nonparametric --- # Pros/Cons of parametric/nonparametric models - .hi[Parametric models:] - Pro: one can interpret the model by looking at the parameters - Con: may not always be flexible enough to fit the data perfectly - .hi[Nonparametric models:] - Pro: usually fit the data better - Con: may not be readily interpretable --- # Continuous dependent variables - A variable is .hi[continuous] if it takes on any real number over some range (typically the entire real number line or the positive real numbers) - .hi[Examples:] sales, earnings, number of page clicks, etc. - The table below lists examples of parametric and nonparametric models appropriate for dependent variables that are continuous | Parametric | Nonparametric | |------------|---------------| | OLS | regression tree (forest, etc.) | | Quantile regression | support vector machine (including k-nearest neighbor) | | naive Bayes regression | Artificial Neural Network (ANN) | | | genetic programming (GP) | | | ... | --- # Binary dependent variables - A variable is .hi[binary] if it only takes on two values (without loss of generality: 0 or 1) - .hi[Examples:] product was purchased by customer (or not), individual has cancer (or not), loan is in default (or not) | Parametric | Nonparametric | |------------|---------------| | Logistic regression | classification tree (forest, etc.) | | Probit regression | support vector classifier | | naive Bayes classification | Artificial Neural Network (ANN) | | | genetic programming (GP) | | | ... | --- # Linear Probabiliy Model vs. Logit <img src="13slides_files/figure-html/lpm-graph-1.png" width="52%" style="display: block; margin: auto;" /> --- # Ordered categorical (aka Ordinal) dep. variables - A variable is .hi[ordered categorical] if it takes on a finite (typically small) set of values 0, 1, ..., K and where .hi[order matters] (i.e. K > K-1 > K-2 > ... > 1 > 0) - Values are mutually exclusive, so that each observation in the data belongs to one and only one category - .hi[Example 1:] consumer has "interest" in a product by either: (0) not clicking on it; (1) clicking on it but not adding it to the cart; (2) adding it to the cart but not purchasing it; (3) purchasing it once; or (4) purchasing it multiple times - .hi[Example 2:] a loan is either: (0) neither in default nor foreclosure; (1) in default but not in foreclosure; or (2) in foreclosure --- # Models for ordinal dependent variables | Parametric | Nonparametric | |------------|---------------| | Ordered logistic regression | ordinal trees | | Ordered probit regression |support vector ordinal regression | | | ANN | | | GP | | | ... | --- # Unordered categorical dependent variables - A variable is .hi[unordered categorical] if it takes on a finite (typically small) set of values (without loss of generality: 0, 1, ..., K) but where *there is no inherent ordering among the categories* - Values are mutually exclusive, so that each observation in the data belongs to one and only one category - .hi[Example 1:] consumer can purchase: (0) no handbags; (1) Louis Vuitton handbags; (2) Coach handbags; (3) Other designer handbags; or (4) non-designer ("generic") handbags - .hi[Example 2:] a person can choose to live in either: (0) Oklahoma City metro area; (1) Tulsa metro area; or (2) somewhere else in Oklahoma --- # Models for unordered categorical dependent variables - The algorithms here are exactly the same as for binary dependent variables, except that now there are multiple categories | Parametric | Nonparametric | |------------|---------------| | Multinomial logistic regression | classification tree (forest, etc.) | | Multinomial probit regression | support vector classifier | | naive Bayes classification | Artificial Neural Network (ANN) | | | genetic programming (GP) | | | ... | --- # Integer-valued dependent variables - A variable is .hi[integer-valued] if it takes on values in the set of integers (0, 1, 2, ..., ∞) - Sometimes data that has this property is referred to as .hi[count data]. - .hi[Example 1:] consumer can smoke 0, 1, or more cigarettes in a day. - .hi[Example 2:] a soccer team can score 0, 1, ..., 8 goals in a game. - .hi[Example 3:] a city can experience 0, 1, or more vehicle accidents in a day. --- # Integer-valued dependent variables - Depending on the setting, some researchers will simply assume log-normality of the dependent variable in this case - Again, a choice like this really depends on the exact case - The parametric algorithms for modeling count data are a bit different from before, but the nonparametric algorithms are quite similar: | Parametric | Nonparametric | |------------|---------------| | Poisson regression | regression tree (forest, etc.) | | Negative binomial regression | support vector machine | | zero-inflated count models | Artificial Neural Network (ANN) | | zero-truncated count models | genetic programming (GP) | | | ... | --- # Independent variables As mentioned previously, we can have all kinds of independent variables Some helpful things to know about independent variables: - don't treat an ordered categorical variable as a continuous variable - use "one-hot" encoding of categorical variables (equivalently `as.factor()` in R) - if using a liner regression model, it is sometimes helpful to create polynomial functions of continuous covariates - interacting two binary covariates (or a binary and a continuous covariate) can also be helpful - How you specify your right hand side variables depends a lot on the goal of your model (i.e. prediction vs. causality)] --- # Observational vs. Experimental data .hi[Observational data] is data that has been collected for no particular purpose - e.g. Census household survey data, twitter stream, Dow Jones stock prices, etc. .hi[Experimental data] is data that scientists have set up where certain units are randomly assigned to take on certain values of a variable of interest (the .hi[treatment variable]) .hi[Causal inference] (as opposed to prediction) is the idea that we can figure out if `\(x\)` causes `\(y\)` by comparing units that were treated with those that were not treated Causal inference requires knowing the "counterfactual": "What would have happened to units in the control group if they had been treated?" --- # Estimating statistical models in R, Python, and Julia Libraries and functions (`library::function`) | Algorithm | R | Python | Julia | |----------|----|--------|-------| | OLS | `base::lm` | `statsmodels::OLS` | `GLM::glm` | | trees | `rpart::rpart` | `sklearn::tree` | `DecisionTree::build_tree` | | k-nearest neighbor | `caret::knn3` | `sklearn::KNeighborsClassifier` | `NearestNeighbors::knn` | | SVM | `e1071::svm` | `sklearn::svm` | `LIBSVM::svmtrain` | | naive Bayes | `e1071::naiveBayes` | `sklearn::GaussianNB` | `NaiveBayes::HybridNB` | | ANN | `nnet::nnet` | `sklearn::neural_network` | `Knet::train` | | genetic prog. | `rgp::geneticProgramming` | `gplearn::est_gp.fit` | `GeneticAlgorithms::runga` | | logistic/probit regression | `base::glm` | `sklearn::linear_model.LogisticRegression` | `GLM::glm` | | ordered logit/probit | `MASS:polr` | `mord::LogisticIT` | `LowRankModels::OrderedMultinomialLoss` | | multionmial logit/probit | `nnet::multinom` | same as logistic | `SciKitLearn::LogisticRegression` | | Poisson regression | `base::glm` | `statsmodels::GLM` | `GLM::glm` | --- # Useful Links - [Multiclass classification](https://en.wikipedia.org/wiki/Multiclass_classification) - [One-hot encoding](https://en.wikipedia.org/wiki/One-hot) - [Genetic programming example in R](https://www.kaggle.com/olegtereshkin/genetic-programming-r-example/code)