The Generalised Linear Model (1)

PSM 2

Bennett Kleinberg

15 Jan 2019

Welcome

Probability, Statistics & Modeling II

Lecture 2

What question do you have?

Today

  • Modelling data
  • Regression in general
  • Linear regression
    • simple
    • multiple
  • Effects in regression analysis
  • Why the GLM?

Modelling data

Overall aim: make inference from sample to population.

  • make assumptions about data generation process
  • model specifies the data by variables

Modelling data

  • Predictions
  • Relationships (extraction information)

Modelling data

Case for today

Dataset 1: Terror data (“Trial and Terror dataset”)

load('./data/terror_data.RData')

names(terror_data)
## [1] "firstName"      "lastName"       "gender"         "case_informant"
## [5] "case_sting"     "sentence"

head(terror_data)
##           firstName    lastName gender case_informant case_sting sentence
## 3           Mubarak       Hamed   male          false      false       57
## 11            Tarek       Makki   male          false      false       26
## 20      Jalal Sadat    Moheisen   male           true       true       69
## 21 Thirunavukkarasu Varatharasa   male           true       true       57
## 22         Reinhard       Rusli   male           true       true       13
## 23    Syed Mustajab        Shah   male           true       true      225
dim(terror_data)
## [1] 471   6

Case for today

Dataset 2: Mass Shootings in detail (Stanford Mass Shootings in America dataset)

load('./data/mass_shootings_detailed.RData')

names(smsd)
##  [1] "caseid"         "n_fatal"        "n_injured"      "date"          
##  [5] "day"            "age"            "gender"         "n_guns"        
##  [9] "school_related" "mental_illness"

head(smsd)
##   caseid n_fatal n_injured       date      day age gender n_guns
## 1      1      16        32   8/1/1966   Monday  20   Male      8
## 2      2       5         1 11/12/1966 Saturday  11   Male      1
## 3      3       9        13   12/31/72   Sunday  17   Male      3
## 4      4       1         3    1/17/74 Thursday   3   Male      3
## 5      5       3         7   12/30/74   Monday   8   Male      3
## 7      7       7         2    7/12/76   Monday  34   Male      1
##   school_related mental_illness
## 1            Yes            Yes
## 2            Yes            Yes
## 3             No            Yes
## 4            Yes            Yes
## 5            Yes             No
## 7            Yes            Yes
dim(smsd)
## [1] 182  10

Core idea of regression

  • Model a relationship between an outcome variable and predictor variable(s)
  • Find relationships in data
  • Make predictions for new data

Core idea of regression

Aim: find a line that simplifies the data

Why linear?

  • Simplest-model principle
  • Many relationships approximate linearity
  • Non-linear relationships are often linear after transformation

Regression formalised

Y = a + b*X + E

Regression formalised

  • The dependent variable Y
  • The predictor variable X
  • The intercept a (= the value of Y if X is 0)
  • The slope b (= the change in Y for every unit change in X)
  • The error term E (= the difference between the predicted value and the observed value)

Regression formalised

Y = a + b*X + E

Note: linear relationship

Regression assumptions

  1. Linear relationship
  2. Little multicollinearity
  3. Residuals i.i.d. (independently, identically distributed)
    • E ~ i.i.d. N(0, sd)

Your shooter model

Modelling the no. of fatalities

victims = intercept + slope*number_of_guns

  • more guns –> more victims?
  • baseline victims –> 3
pred.victims = 3 + 1.5*smsd$n_guns

Your shooter model

head(smsd, 1)
##   caseid n_fatal n_injured     date    day age gender n_guns
## 1      1      16        32 8/1/1966 Monday  20   Male      8
##   school_related mental_illness
## 1            Yes            Yes
case_1 = 3+1.5*8
case_1
## [1] 15

Your shooter model

plot(smsd$n_fatal, pred.victims, ylim=c(0,30))