Final Project

Mei Lin Verghese Law Kung Sam(s194685), Thelma Urena(s242552), Alejandra Caballero(s231912),Marco Bonafede(s243247), Carlos Sainz(s253695)

Introduction

Our project is based on dataset GEOquery [GSE54514]

And this paper

Materials and Methods

The raw data:

dimension = 1467 x 2

Number of samples = 163

But a tidy data set should follow these rules:

Every column contains exactly one variable.
Every row represents exactly one experimental unit (sample).
Every cell contains exactly one measurement.

Materials and Methods

Merge gene_names and expression_data

merged_data <- expression_data |>
  left_join(annotation_data, by = "ILMN_ID")

Filter so it is only genes of interest

merged_data_clean <- merged_data |>
  dplyr::filter(Gene_ID %in% gene_list)

If gene has more than 1 ILMN_ID, get average of gene expression

clean_data <- merged_data_clean |>
  group_by(Gene_ID) |>
  summarise(across(starts_with("GSM"), mean, na.rm = TRUE)) |>
  ungroup()

Pivot and merge by sample

second_clean <- clean_data |>
  pivot_longer(
    cols = starts_with("GSM"),
    names_to = "sample",
    values_to = "expression"
  )

merged_final <- second_clean |>
  left_join(meta_clean, by = "sample")

dimension = 2445 x 11

Materials and Methods

Separate characteristics into key/value pairs

meta_separated <- meta_untidy |>
  separate(characteristics, into = c("key", "value"), sep = ": ")

Convert long format to wide format

meta_wide <- meta_separated |>
  pivot_wider(names_from = key, values_from = value)

Replace string "NA" with actual NA

meta_wide <- meta_wide |>
  mutate_all(~ ifelse(. == "NA", NA, .))

Split group_day into two variables

meta_clean <- meta_wide |>
  separate(group_day, into = c("group_temp", "day"), sep = "_", remove = TRUE) |>
  select(-group_temp)

Materials and Methods

Which samples are in the control group

pheno_data_control <- pheno_data |> 
  mutate(control = str_detect(title, '(Control)'))  |> 
  mutate(control = factor(control, levels = c(TRUE, FALSE)))  |> 
  group_by(control)

Merge phenotype data, expression data and platform data

merged_pheno_control <- merged_unfiltered |> 
  left_join(y = pheno_data_control,
            by = join_by(sample == geo_accession))

Results and Discussion

Boxplots for expression per gene in survivors vs non-survivors

Results and Discussion

Difference in day 1 gene expression in survivors vs non-survivors