In the last two assignments, you used a set of variables to predict an outcome. Now, try to see whether there are any clusters among those variables. That is, use the clustering techniques we talked about in class to try to uncover any structure underlying the predictors in your model. Describe what you find. This assignment is due December 22, 2020.

I use an GLRM to make a political profile of some variables that showed up as relevant in previous assignments. These variables are categorical, so when used in analysis, they increased the dimensionality of the data. These variables are related, so it would make sense to cluster them together.

## Importing data----
getwd()
anes16<-import("./ML/ANES/raw/anes_timeseries_2016_dta/anes_timeseries_2016.dta")

d1 <- anes16 %>%
  select(V161007, V161267x,V161342,V161310x,V161270,V161010d,
         V161268, V161115, V161217:V161223, V161326:V161329,
         V161522, ends_with('x')) %>% 
  rename(internet= V161007,
         int_home = V161326,
         age     = V161267x,
         gender  = V161342,
         race    = V161310x,
         education = V161270,
         region = V161010d,
         income = V161361x,
         trust_peple = V161219,
         gov_waste = V161217,
         unemp = V161142x,
         marital = V161268,
         Pres_for_rel = V161084x,
         occup = V161276x) %>% 
  # remove_noresponse() %>%
  mutate( internet = case_when(
    internet == 2 ~ 0,
    internet == 1 ~ 1),
    int_home = case_when(
      int_home == 2 ~ 0,
      int_home == 1 ~ 1)
  ) %>% 
  # filter(!is.na(int_home)) %>%
  filter(!is.na(internet)) %>%
  # drop_na(internet) %>% #I could have also used
  factorize() #%>%
# reduce_fct_levels()


# Wrangling data ----------------------------------------------------------

library(recipes)
blueprint <- recipe(~ ., data=d1) %>%
  step_nzv(all_predictors())  %>% 
  step_knnimpute(all_numeric(), neighbors = 6) %>%
  step_other(all_nominal(), threshold = 0.01,
             other = "other") %>%
  step_unknown(all_nominal()) %>%
  step_YeoJohnson(all_numeric()) %>% 
  step_scale(all_numeric()) %>%
  step_dummy(all_nominal(), one_hot = FALSE)
prep_data <- recipes::prep(blueprint, d1)
tmp_data <- bake(prep_data, 
                 new_data=d1)


# Select variables to cluster
pol_prof <- tmp_data %>%
  select(starts_with(c('trust_peple',
                       'gov_waste',
                       'Pres_for_re'))) #%>%
  # as.matrix()
library(h2o)
h2o.no_progress()  # turn off progress bars
h2o.init(max_mem_size = "5g")  # connect to H2O instance
pp_h <- as.h2o(pol_prof)
split <- h2o.splitFrame(pp_h, ratios = 0.6, seed = 123)
train <- split[[1]]
valid <- split[[2]]
# Create hyperparameter search grid
params <- expand.grid(
  regularization_x = c("None", "NonNegative", "L1"),
  regularization_y = c("None", "NonNegative", "L1"),
  gamma_x = seq(0, 1, by = .25),
  gamma_y = seq(0, 1, by = .25),
  error = 0,
  stringsAsFactors = FALSE
)

for(i in seq_len(nrow(params))) {
  # Create model
  glrm_pp <- h2o.glrm(
    training_frame = train,
    k = 2, 
    loss = "Quadratic",
    regularization_x = params$regularization_x[i], 
    regularization_y = params$regularization_y[i],
    gamma_x = params$gamma_x[i],
    gamma_y = params$gamma_y[i],
    transform = "STANDARDIZE", 
    max_runtime_secs = 1000,
    seed = 333
  )
  # Predict on validation set and extract error
  validate <- h2o.performance(glrm_pp, valid)
  params$error[i] <- validate@metrics$numerr
}

params %>% arrange(error) %>% head(10)

glrm_pp <- h2o.glrm(
  training_frame = pp_h,
  k = 2, 
  loss = "Quadratic",
  regularization_x = "None", 
  regularization_y = "NonNegative",
  gamma_x = 0,
  gamma_y = 0,
  transform = "STANDARDIZE", 
  max_runtime_secs = 1000,
  seed = 333
)

glrm_pp@model$importance

labels <- colnames(glrm_pp@model$archetypes)
t(glrm_pp@model$archetypes) %>% 
  as_tibble() %>% 
  mutate(label = labels) %>% 
  arrange(desc(Arch1),desc(Arch2))


p1 <- t(glrm_pp@model$archetypes) %>% 
  as.data.frame() %>% 
  mutate(feature = row.names(.)) %>%
  ggplot(aes(Arch1, reorder(feature, Arch1))) +
  geom_point()

p2 <- t(glrm_pp@model$archetypes) %>% 
  as.data.frame() %>% 
  mutate(feature = row.names(.)) %>%
  ggplot(aes(Arch2, reorder(feature, Arch2))) +
  geom_point()


gridExtra::grid.arrange(p1, p2, nrow = 1)
regularization_x regularization_y gamma_x gamma_y error
None NonNegative 0.00 0 4970.378
NonNegative NonNegative 0.00 0 4970.378
L1 NonNegative 0.00 0 4970.378
None NonNegative 0.25 0 4970.378
NonNegative NonNegative 0.25 0 4970.378
L1 NonNegative 0.25 0 4970.378
None NonNegative 0.50 0 4970.378
NonNegative NonNegative 0.50 0 4970.378
L1 NonNegative 0.50 0 4970.378
None NonNegative 0.75 0 4970.378
pc1 pc2
Standard deviation 1.0383576 0.7750300
Proportion of Variance 0.0898489 0.0500560
Cumulative Proportion 0.0898489 0.1399048

Archetypes of GLRM

It seems that the Archetypes are capturing two types of people. On one hand, they disapprove how the president was handling foreign relationships, consider government is wasting money somewhat, and trust people half and most the time. On the other, there’s people that think government does not waste very much, but never trusts people.

Nevertheless, in terms of the variation, the archetypes are not capturing a lot of variation.