This documents narrates the analysis of Titanic survival data.

Packages that will be used in this analysis:

# Attach these packages so their functions don't need to be qualified: http://r-pkgs.had.co.nz/namespace.html#search-path
library(magrittr) # enables piping : %>%
library(dplyr)
library(ggplot2)
library(titanic)
# Verify these packages are available on the machine, but their functions need to be qualified: http://r-pkgs.had.co.nz/namespace.html#search-path
requireNamespace("tidyr") # data manipulation
# requireNamespace("testit")# For asserting conditions meet expected patterns.
# requireNamespace("car") # For it's `recode()` function.

Wrangling

Load

Import the data prepared by the ./manipulation/0-greeter.R
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 S
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0 PC 17599 71.2833 C85 C
3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.9250 S
4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1000 C123 S
5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.0500 S
6 0 3 Moran, Mr. James male NA 0 0 330877 8.4583 Q
Observations: 891
Variables: 12
$ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, ...
$ Survived    <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1,...
$ Pclass      <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3, 2, 2, 3, 1, 3, 3, 3, 1, 3, 3, 1, 1,...
$ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Florence Briggs Thayer)", "Heikkinen, M...
$ Sex         <chr> "male", "female", "female", "female", "male", "male", "male", "male", "female", "female", "fema...
$ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, 55, 2, NA, 31, NA, 35, 34, 15, 28, 8,...
$ SibSp       <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0, 0, 0, 0, 0, 3, 1, 0, 3, 0, 0, 0, 1,...
$ Parch       <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 5, 0, 2, 0, 0, 0, 0,...
$ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "373450", "330877", "17463", "349909", "...
$ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625, 21.0750, 11.1333, 30.0708, 16.7000, ...
$ Cabin       <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C103", "", "", "", "", "", "", "", "",...
$ Embarked    <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S", "S", "S", "S", "S", "Q", "S", "S", ...

Inspect

To better understand the data set, let us inspect quantitative properties of each variable
variable type na na_pct unique min mean max
PassengerId int 0 0.0 891 1.00 446.00 891.00
Survived int 0 0.0 2 0.00 0.38 1.00
Pclass int 0 0.0 3 1.00 2.31 3.00
Name chr 0 0.0 891 NA NA NA
Sex chr 0 0.0 2 NA NA NA
Age dbl 177 19.9 89 0.42 29.70 80.00
SibSp int 0 0.0 7 0.00 0.52 8.00
Parch int 0 0.0 7 0.00 0.38 6.00
Ticket chr 0 0.0 681 NA NA NA
Fare dbl 0 0.0 248 0.00 32.20 512.33
Cabin chr 0 0.0 148 NA NA NA
Embarked chr 0 0.0 4 NA NA NA

Tweek

To prepare our data for modeling, let perform routine data transformations:
* 1. Convert column names to lowercase
* 2. Select and sort columns * 3. Rename columns
* 4. Covert strings to factors
* 5. Filter out missing values

Observations: 712
Variables: 8
$ survived           <fct> Died, Survived, Survived, Survived, Died, Died, Died, Survived, Survived, Survived, Surv...
$ pclass             <fct> Third, First, Third, First, Third, First, Third, Third, Second, Third, First, Third, Thi...
$ sex                <fct> Men, Women, Women, Women, Men, Men, Men, Women, Women, Women, Women, Men, Men, Women, Wo...
$ age                <dbl> 22, 38, 26, 35, 35, 54, 2, 27, 14, 4, 58, 20, 39, 14, 55, 2, 31, 35, 34, 15, 28, 8, 38, ...
$ n_siblings_spouses <int> 1, 1, 0, 1, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 1, 0, 0, 0, 0, 3, 1, 3, 0, 0, 1, 1, 0, 2...
$ n_parents_children <int> 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0, 0, 0, 1, 5, 2, 0, 0, 0, 0, 0, 0...
$ price_ticket       <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 51.8625, 21.0750, 11.1333, 30.0708, 16.7000, 2...
$ port_embarked      <chr> "S", "C", "S", "S", "S", "S", "S", "S", "C", "S", "S", "S", "S", "S", "S", "Q", "S", "S"...

Tabulating

Summary tables to help us see observed differenced broken down by levels of predictors

ds_modeling %>%
  dplyr::group_by(survived, sex) %>%
  dplyr::summarize(
    n_people = n()
    ,mean_age = mean(age, na.rm = T)
  )
survived sex n_people mean_age
Died Men 360 31.61806
Died Women 64 25.04688
Survived Men 93 27.27602
Survived Women 195 28.63077

Modeling

0 - Sex

Summary (m0)


Call:
stats::glm(formula = survived ~ sex, family = "binomial", data = ds_modeling)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.6721  -0.6779  -0.6779   0.7534   1.7795  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -1.3535     0.1163  -11.64   <2e-16 ***
sexWomen      2.4676     0.1852   13.33   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 960.90  on 711  degrees of freedom
Residual deviance: 749.57  on 710  degrees of freedom
AIC: 753.57

Number of Fisher Scoring iterations: 4

Graph (g0)