This documents narrates the analysis of Titanic survival data.
Packages that will be used in this analysis:
# Attach these packages so their functions don't need to be qualified: http://r-pkgs.had.co.nz/namespace.html#search-path
library(magrittr) # enables piping : %>%
library(dplyr)
library(ggplot2)
library(titanic)
# Verify these packages are available on the machine, but their functions need to be qualified: http://r-pkgs.had.co.nz/namespace.html#search-path
requireNamespace("tidyr") # data manipulation
# requireNamespace("testit")# For asserting conditions meet expected patterns.
# requireNamespace("car") # For it's `recode()` function.
./manipulation/0-greeter.R
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22 | 1 | 0 | A/5 21171 | 7.2500 | S | |
2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | female | 38 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26 | 0 | 0 | STON/O2. 3101282 | 7.9250 | S | |
4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35 | 1 | 0 | 113803 | 53.1000 | C123 | S |
5 | 0 | 3 | Allen, Mr. William Henry | male | 35 | 0 | 0 | 373450 | 8.0500 | S | |
6 | 0 | 3 | Moran, Mr. James | male | NA | 0 | 0 | 330877 | 8.4583 | Q |
Observations: 891
Variables: 12
$ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, ...
$ Survived <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1,...
$ Pclass <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3, 2, 2, 3, 1, 3, 3, 3, 1, 3, 3, 1, 1,...
$ Name <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Florence Briggs Thayer)", "Heikkinen, M...
$ Sex <chr> "male", "female", "female", "female", "male", "male", "male", "male", "female", "female", "fema...
$ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, 55, 2, NA, 31, NA, 35, 34, 15, 28, 8,...
$ SibSp <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0, 0, 0, 0, 0, 3, 1, 0, 3, 0, 0, 0, 1,...
$ Parch <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 5, 0, 2, 0, 0, 0, 0,...
$ Ticket <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "373450", "330877", "17463", "349909", "...
$ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625, 21.0750, 11.1333, 30.0708, 16.7000, ...
$ Cabin <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C103", "", "", "", "", "", "", "", "",...
$ Embarked <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S", "S", "S", "S", "S", "Q", "S", "S", ...
variable | type | na | na_pct | unique | min | mean | max |
---|---|---|---|---|---|---|---|
PassengerId | int | 0 | 0.0 | 891 | 1.00 | 446.00 | 891.00 |
Survived | int | 0 | 0.0 | 2 | 0.00 | 0.38 | 1.00 |
Pclass | int | 0 | 0.0 | 3 | 1.00 | 2.31 | 3.00 |
Name | chr | 0 | 0.0 | 891 | NA | NA | NA |
Sex | chr | 0 | 0.0 | 2 | NA | NA | NA |
Age | dbl | 177 | 19.9 | 89 | 0.42 | 29.70 | 80.00 |
SibSp | int | 0 | 0.0 | 7 | 0.00 | 0.52 | 8.00 |
Parch | int | 0 | 0.0 | 7 | 0.00 | 0.38 | 6.00 |
Ticket | chr | 0 | 0.0 | 681 | NA | NA | NA |
Fare | dbl | 0 | 0.0 | 248 | 0.00 | 32.20 | 512.33 |
Cabin | chr | 0 | 0.0 | 148 | NA | NA | NA |
Embarked | chr | 0 | 0.0 | 4 | NA | NA | NA |
To prepare our data for modeling, let perform routine data transformations:
* 1. Convert column names to lowercase
* 2. Select and sort columns * 3. Rename columns
* 4. Covert strings to factors
* 5. Filter out missing values
Observations: 712
Variables: 8
$ survived <fct> Died, Survived, Survived, Survived, Died, Died, Died, Survived, Survived, Survived, Surv...
$ pclass <fct> Third, First, Third, First, Third, First, Third, Third, Second, Third, First, Third, Thi...
$ sex <fct> Men, Women, Women, Women, Men, Men, Men, Women, Women, Women, Women, Men, Men, Women, Wo...
$ age <dbl> 22, 38, 26, 35, 35, 54, 2, 27, 14, 4, 58, 20, 39, 14, 55, 2, 31, 35, 34, 15, 28, 8, 38, ...
$ n_siblings_spouses <int> 1, 1, 0, 1, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 1, 0, 0, 0, 0, 3, 1, 3, 0, 0, 1, 1, 0, 2...
$ n_parents_children <int> 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0, 0, 0, 1, 5, 2, 0, 0, 0, 0, 0, 0...
$ price_ticket <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 51.8625, 21.0750, 11.1333, 30.0708, 16.7000, 2...
$ port_embarked <chr> "S", "C", "S", "S", "S", "S", "S", "S", "C", "S", "S", "S", "S", "S", "S", "Q", "S", "S"...
Summary tables to help us see observed differenced broken down by levels of predictors
ds_modeling %>%
dplyr::group_by(survived, sex) %>%
dplyr::summarize(
n_people = n()
,mean_age = mean(age, na.rm = T)
)
survived | sex | n_people | mean_age |
---|---|---|---|
Died | Men | 360 | 31.61806 |
Died | Women | 64 | 25.04688 |
Survived | Men | 93 | 27.27602 |
Survived | Women | 195 | 28.63077 |
Call:
stats::glm(formula = survived ~ sex, family = "binomial", data = ds_modeling)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.6721 -0.6779 -0.6779 0.7534 1.7795
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.3535 0.1163 -11.64 <2e-16 ***
sexWomen 2.4676 0.1852 13.33 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 960.90 on 711 degrees of freedom
Residual deviance: 749.57 on 710 degrees of freedom
AIC: 753.57
Number of Fisher Scoring iterations: 4