Intro al uso de {tidymodels}

class: center, middle, inverse, title-slide

# Intro al uso de {tidymodels}
## RLadies Santiago, Chile
### Sara Acevedo
### Marzo 2020

---

##  Notas importantes antes de empezar

No olviden revisar el **código de conducta**. Este es un ambiente seguro y no se tolera el acoso: https://github.com/rladies/starter-kit/wiki/Code-of-Conduct#spanish

Presentaciones en Xaringan:
* https://github.com/semiramisCJ/taller_xaringan_RLadiesMty2020 
* https://github.com/sporella/xaringan_github

El código estará disponible en GitHub: https://github.com/Saryace

Mis redes: twitter: @saryace instagram lab: @soilbiophysics twitter lab: @soilbiophysics1

---
##  Notas importantes antes de empezar

## Plan para esta sesión:

.pull-left[ 
Cosas que veremos hoy
* Paquetes y sus usos
* Funciones más importantes
* Implementar un modelo lineal 
* Visualización básica
]
.pull-right[ 
]

---
##  Notas importantes antes de empezar

## Si hay cosas que no entiendes del taller

.pull-left[ 
* Es normal, quizás iremos algo rápido
* El código y la presentación quedará disponible
* Habrá espacios para preguntas
]
.pull-right[ 
<img src=https://1.bp.blogspot.com/-WdKQMDR7ijE/VlIm8xqINqI/AAAAAAAA404/1wWcyHthAkQ/s1600/pantera-2.gif img>
]

---
class: inverse, center, middle
# Empecemos

---

##  Paquetes Tidymodels

* Sintaxis tidyverse
* Reproducibilidad de datos
* Developer Max Kuhn {library(caret)}
* Hoy usaremos rsample, parsnip, recipes y yardstick
* Otros: corrr, dials, workflows, tune

<img src=https://rviews.rstudio.com/post/2019-06-14-a-gentle-intro-to-tidymodels_files/figure-html/tidymodels.png img> 
Figura: https://rviews.rstudio.com/2019/06/19/a-gentle-intro-to-tidymodels/
---

- Instalar los paquetes **tidyverse**, **tidymodels**, junto con sus dependencias

```r
install.packages(c("tidyverse", "tidymodels")), dependencies = TRUE)
```

- Instalar el paquete **remotes** y **ggsignif**, junto con sus dependencias

```r
install.packages(c("remotes", "ggsignif")), dependencies = TRUE)
```

- Instalar desde github el paquete **datos** y **corrr**

```r
remotes::install_github("cienciadedatos/datos")
remotes::install_github("tidymodels/corrr")
```

---
class: inverse, center, middle
# Base de datos: pinguinos

<img src=https://github.com/allisonhorst/palmerpenguins/raw/master/man/figures/lter_penguins.png img>
Artwork by [@allison_horst](https://github.com/allisonhorst/)
---
# library() y glimpse()

```r
# librerias
library(tidyverse)
library(tidymodels)
library(remotes)
library(datos)
library(ggsignif)
library(corrr)
# estilo ggplot
theme_set(theme_bw())
# cargar la database
pinguinos <- datos::pinguinos
# echar un vistazo
dplyr::glimpse(pinguinos)
```

```
## Rows: 344
## Columns: 8
## $ especie         <fct> Adelia, Adelia, Adelia, Adelia, Adelia, Adelia, Adelia…
## $ isla            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgersen,…
## $ largo_pico_mm   <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, 42…
## $ alto_pico_mm    <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, 20…
## $ largo_aleta_mm  <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186, …
## $ masa_corporal_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, 42…
## $ sexo            <fct> macho, hembra, hembra, NA, hembra, macho, hembra, mach…
## $ anio            <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, …
```

---
##  Un poco de limpieza

```r
# arbitrariamente eliminaremos 
pinguinos_db <-  pinguinos %>% 
                 drop_na() %>% # las observaciones con datos ausentes
                 select(-anio) # la columna anio
# revisamos nuestro nuevo archivo
glimpse(pinguinos_db)
```

```
## Rows: 333
## Columns: 7
## $ especie         <fct> Adelia, Adelia, Adelia, Adelia, Adelia, Adelia, Adelia…
## $ isla            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgersen,…
## $ largo_pico_mm   <dbl> 39.1, 39.5, 40.3, 36.7, 39.3, 38.9, 39.2, 41.1, 38.6, …
## $ alto_pico_mm    <dbl> 18.7, 17.4, 18.0, 19.3, 20.6, 17.8, 19.6, 17.6, 21.2, …
## $ largo_aleta_mm  <int> 181, 186, 195, 193, 190, 181, 195, 182, 191, 198, 185,…
## $ masa_corporal_g <int> 3750, 3800, 3250, 3450, 3650, 3625, 4675, 3200, 3800, …
## $ sexo            <fct> macho, hembra, hembra, hembra, macho, hembra, macho, h…
```

---
##  Exploramos datos visualmente

```r
pinguinos_db %>% ggplot(aes(x=largo_aleta_mm, y=masa_corporal_g,
                        color = sexo, size =masa_corporal_g)) +
                 geom_point(alpha=0.5) 
```

---
##  Diferencias macho y hembra por especie

```r
pinguinos_db %>% ggplot(aes(x=sexo, y=masa_corporal_g, fill=sexo)) + 
                        geom_boxplot() +
                        facet_wrap(~especie) +
                        geom_signif(comparisons = list(c("macho", "hembra")), 
                        map_signif_level=TRUE,
                        test = "t.test")
```

<img src="tallertidymodels_files/figure-html/unnamed-chunk-7-1.png" width="350px" style="display: block; margin: auto;" />
---
##  Correlación entre variables numéricas

```r
pinguinos_db %>% 
  select(-especie,-sexo,-isla) %>% 
  corrr::correlate() 
```

```
## 
## Correlation method: 'pearson'
## Missing treated using: 'pairwise.complete.obs'
```

```
## # A tibble: 4 x 5
##   term            largo_pico_mm alto_pico_mm largo_aleta_mm masa_corporal_g
##   <chr>                   <dbl>        <dbl>          <dbl>           <dbl>
## 1 largo_pico_mm          NA           -0.229          0.653           0.589
## 2 alto_pico_mm           -0.229       NA             -0.578          -0.472
## 3 largo_aleta_mm          0.653       -0.578         NA               0.873
## 4 masa_corporal_g         0.589       -0.472          0.873          NA
```
---

# Correlación entre variables numéricas

```r
pinguinos_db %>% 
  select(-especie,-sexo,-isla) %>% 
  corrr::correlate() %>% 
  rearrange() %>%  # ordena las correlaciones
  shave() %>%# limpia las correlaciones repetidas
  fashion()
```

```
##              term largo_aleta_mm masa_corporal_g largo_pico_mm alto_pico_mm
## 1  largo_aleta_mm                                                          
## 2 masa_corporal_g            .87                                           
## 3   largo_pico_mm            .65             .59                           
## 4    alto_pico_mm           -.58            -.47          -.23
```
---

# Correlación entre variables numéricas

```r
pinguinos_db %>% 
  select(-especie,-sexo,-isla) %>% 
  corrr::correlate() %>% 
  network_plot()
```

<img src="tallertidymodels_files/figure-html/unnamed-chunk-10-1.png" width="350px" style="display: block; margin: auto;" />
---
##  Objetivo
* Predecir la masa corporal de un pinguino, en base a sus caracteristicas físicas
* Interpretar los resultados que obtendremos
---
class: inverse, center, middle
# Primer paso: dividir el dataset en entrenamiento y testeo

https://github.com/rstudio/hex-stickers/blob/master/thumbs/rsample.png

---
##  Dividimos el dataset en 80% entrenamiento y 20% testeo

```r
set.seed(1234)
division      <-  initial_split(data = pinguinos_db, prop = .8)
entrenamiento <-  training(division)
testeo        <-  testing(division)
```

```r
nrow(entrenamiento)
```

```
## [1] 267
```

```r
nrow(testeo)
```

```
## [1] 66
```
---
class: inverse, center, middle
# Segundo paso: crear una receta

https://github.com/rstudio/hex-stickers/blob/master/thumbs/recipes.png

---
##  Creamos una receta para usar nuestras variables

```r
masa_recipe <-recipe(masa_corporal_g ~ ., data = entrenamiento) %>% 
              step_corr(all_numeric()) %>%
              step_dummy(all_nominal()) %>% 
              prep()

masa_recipe
```

```
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor          6
## 
## Training data contained 267 data points and no missing data.
## 
## Operations:
## 
## Correlation filter removed no terms [trained]
## Dummy variables from especie, isla, sexo [trained]
```

---
##  Creamos una receta para usar nuestras variables

```r
entrenamiento_juice <- masa_recipe %>%
                       juice()

testeo_bake         <- masa_recipe %>%
                       bake(testeo)
```
---
##  Creamos una receta para usar nuestras variables

```r
head(entrenamiento_juice, 3)
```

```
## # A tibble: 3 x 9
##   largo_pico_mm alto_pico_mm largo_aleta_mm masa_corporal_g especie_Barbijo
##           <dbl>        <dbl>          <int>           <int>           <dbl>
## 1          39.1         18.7            181            3750               0
## 2          39.5         17.4            186            3800               0
## 3          40.3         18              195            3250               0
## # … with 4 more variables: especie_Papúa <dbl>, isla_Dream <dbl>,
## #   isla_Torgersen <dbl>, sexo_macho <dbl>
```

```r
head(testeo_bake, 3)        
```

```
## # A tibble: 3 x 9
##   largo_pico_mm alto_pico_mm largo_aleta_mm masa_corporal_g especie_Barbijo
##           <dbl>        <dbl>          <int>           <int>           <dbl>
## 1          36.7         19.3            193            3450               0
## 2          34.6         21.1            198            4400               0
## 3          38.2         18.1            185            3950               0
## # … with 4 more variables: especie_Papúa <dbl>, isla_Dream <dbl>,
## #   isla_Torgersen <dbl>, sexo_macho <dbl>
```
---

class: inverse, center, middle
# Tercer paso: usar recetas y entrenar nuestros datos

https://github.com/rstudio/hex-stickers/blob/master/thumbs/parsnip.png

---
##  Creamos nuestro modelo

```r
modelo_lineal <- linear_reg() %>% 
                 set_engine("lm") %>% 
                 set_mode("regression")

translate(modelo_lineal)
```

```
## Linear Regression Model Specification (regression)
## 
## Computational engine: lm 
## 
## Model fit template:
## stats::lm(formula = missing_arg(), data = missing_arg(), weights = missing_arg())
```

---
class: inverse, center, middle
# Estoy perdid@, son muchos objetos y funciones

---
##  Recapitulemos
* Datos:

```r
entrenamiento #80%
testeo #20%
```
* Receta: creamos datos dummies

```r
masa_recipe
```
* Nuevos set de datos

```r
entrenamiento_juice
testeo_bake
```
* Creamos un modelo linear

```r
modelo_lineal
```

---

```r
ml_ajuste <- modelo_lineal %>%
             fit(masa_corporal_g ~ ., data = entrenamiento_juice)

ml_ajuste
```

```
## parsnip model object
## 
## Fit time:  4ms 
## 
## Call:
## stats::lm(formula = masa_corporal_g ~ ., data = data)
## 
## Coefficients:
##     (Intercept)    largo_pico_mm     alto_pico_mm   largo_aleta_mm  
##        -1108.40            23.10            53.43            14.47  
## especie_Barbijo    especie_Papúa       isla_Dream   isla_Torgersen  
##         -282.63           966.68            14.13           -25.70  
##      sexo_macho  
##          378.16
```

---

```r
lm_prediccion <- ml_ajuste %>%
                 predict(testeo_bake) %>%
                 bind_cols(testeo_bake)

lm_prediccion
```

```
## # A tibble: 66 x 10
##    .pred largo_pico_mm alto_pico_mm largo_aleta_mm masa_corporal_g
##    <dbl>         <dbl>        <dbl>          <int>           <int>
##  1 3537.          36.7         19.3            193            3450
##  2 4035.          34.6         21.1            198            4400
##  3 3796.          38.2         18.1            185            3950
##  4 3849.          40.6         18.6            183            3550
##  5 3344.          36.5         18              182            3150
##  6 3340.          37           16.9            185            3000
##  7 3952.          39.6         18.8            190            4600
##  8 3361.          34.5         18.1            187            2900
##  9 3345.          37.6         17              185            3600
## 10 3970.          41.6         18              192            3950
## # … with 56 more rows, and 5 more variables: especie_Barbijo <dbl>,
## #   especie_Papúa <dbl>, isla_Dream <dbl>, isla_Torgersen <dbl>,
## #   sexo_macho <dbl>
```
---
class: inverse, center, middle
# Yardstick: evaluar el modelo

https://github.com/rstudio/hex-stickers/blob/master/thumbs/yardstick.png?raw=true

---

```r
lm_prediccion %>% metrics(truth = masa_corporal_g, estimate = .pred)
```

```
## # A tibble: 3 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard     294.   
## 2 rsq     standard       0.873
## 3 mae     standard     237.
```

---

```r
lm_prediccion %>%  ggplot(aes(x=masa_corporal_g, y=.pred,
                               color=masa_corporal_g)) +
                   geom_point(alpha=0.5) +
                   geom_abline() +
                   coord_equal() +
                   ylim(c(2000,6000)) +
                   xlim(c(2000,6000)) 
```

<img src="tallertidymodels_files/figure-html/unnamed-chunk-25-1.png" width="450px" style="display: block; margin: auto;" />
---
# Mas información, códigos y talleres
* [Tidymodels.org](https://www.tidymodels.org/)
* [Latin R](https://github.com/tidymodels-latam-workshops/latinR2020)
* [Linear and Bayesian Regression Models with tidymodels package, Masumbuko Semba](https://semba-blog.netlify.app/05/11/2020/regression-with-tidymodels/)
* [Tidymodel and glmnet, Jun Kang](https://www.jkangpathology.com/post/tidymodel-and-glmnet/)
* [Canal de youtube de Silvia Silge](https://www.youtube.com/channel/UCTTBgWyJl2HrrhQOOc710kA)
---