Lecture Lab 1

Leon Eyrich Jessen

Course Introduction

DATADATADATA: Data Hoarding

Increasing the Value of Data Requires Activation!

Data Driven Decision Making

Because we’ve always done it this way!

Your job as a Bio Data Scientist:

  • Activate data
  • Extract insights
  • Communicate to non-data stakeholders
  • Facilitate data driven decision making

Levering data driven decision making allows the company to gain a competitive edge and this is where you Bio Data Science skills are indispensable!

You value as a Bio Data Scientist / Bioinformatician

  • In your career, your task will be to create value

  • This is regardless of whether you plan to work in indstry or pursue a research career

  • What you do has to create value

  • Creating value requires skills

  • Skills need to be learned

  • So, why are you here?

You’re here to gain skills, which will allow you to generate value!

R for Bio Data Science - What is it?

  • In essence: The art of converting numbers to value
    • Ingest data
    • Transform, wrangle, visualise, model
    • Output insights

R for Bio Data Science - Intrinsically interdisciplinary

Why “Bio” in R for Bio Data Science?

What is the motivation for this Course?

What is the motivation for this Course?

What will you learn?

  • The craft of going from raw extracted data to insights
  • Advanced data visualisation
  • Collaborative project oriented coding
  • All with an emphasis on reproducibility and communication

R

Introducing R: A Journey into Bio Data Science

  • Open-source programming language
  • Essential tool for statistics & data visualization
  • Widely used in bioinformatics and data science
  • Dynamic community & vast library of packages

“To understand computations in R, two slogans are helpful: Everything that exists is an object. Everything that happens is a function call.” – John Chambers (creator of the S language, of which R is an implementation).

The Roots and Rise of R

  • Originated from the ‘S’ language at Bell Laboratories in the 1970s
  • S was proprietary, so basically R is an open source implementation of S and was officially released in 1995
  • This similar to Linux vs. Unix
  • A leader in statistical computing. Powers many academic research & industry projects
  • E.g. Crucial in genomics, where R aids in decoding biological data
  • R comes with a very large and well proven built in tools for data analysis

A Few Examples of Functional Programming

You can approach R as

  • an object-oriented programming language

Let’s say we have this vector

my_vector <- c(49, 31, 24, 35, 71, 7, 36, 23, 67, 37)

Now, we want to compute the mean, we can do:

Object Oriented Approach:

Vector <- R6::R6Class("Vector",
  public = list(
    data = NULL,
    initialize = function(data) {
      if (!is.numeric(data)) {
        stop("Data should be numeric.")
      }
      self$data <- data
    },
    mean = function() {
      return(sum(self$data) / length(self$data))
    }
  )
)
numbers <- Vector$new(my_vector)
print(numbers$mean())
[1] 38

A Few Examples of Functional Programming

You can approach R as

  • an object-oriented programming language
  • a imperative programming language

Let’s say we have this vector

my_vector <- c(49, 31, 24, 35, 71, 7, 36, 23, 67, 37)

Now, we want to compute the mean, we can do:

Imperative Approach:

my_sum <- 0
for( i in 1:length(my_vector) ){
  my_sum <- my_sum + my_vector[i]
}
my_mean <- my_sum / length(my_vector)
print(my_mean)
[1] 38

A Few Examples of Functional Programming

You can approach R as

  • an object-oriented programming language
  • a imperative programming language
  • a functional programming language

The code on the right all performs the same task, but which do you think is:

  • simpler to read and understand?
  • faster to execute?

In this course we will work with R in its native form - a fully fledged functional programming language

Let’s say we have this vector

my_vector <- c(49, 31, 24, 35, 71, 7, 36, 23, 67, 37)

Now, we want to compute the mean, we can do:

Functional approach:

my_mean <- mean(my_vector)
print(my_mean)
[1] 38

You simply call functions on objects

Standard Deviation

sd(my_vector)
[1] 19.70899

Median

median(my_vector)
[1] 35.5

Permute

sample(my_vector)
 [1]  7 23 31 35 37 24 49 71 36 67

Bootstrap

sample(my_vector, replace = TRUE)
 [1] 67 31 49 31 24 31 67 36 37 37

…and tons more!

“R is not a real programming language”: Debunking Myths I

  1. R is Turing-complete:
    • R can theoretically solve any computational problem. Foundational concept shared with e.g. Python, C++, Java, etc.
  2. R is fully capable in Production:
    • E.g. shiny apps used in industry and R comes with an ecosystem supporting reproducibility in production settings.
  3. Comprehensive Ecosystem:
    • CRAN contains ~20,000 packages. Also Bioconductor is a gold standard for bioinformatics software.
  4. Interoperability:
    • Seamless integration with other languages (C, C++, Fortran, and Python) using packages like Rcpp and reticulate.

“R is not a real programming language”: Debunking Myths II

  1. Advanced Programming Features:
    • Supports object-oriented, functional, and imperative programming paradigms. Flexible metaprogramming with capabilities like non-standard evaluation
  2. R’s Active & Growing Community:
    • Annual global R conferences and numerous local user groups and also: tidyverse
  3. Performance:
    • R is interpreted and can be slower, packages like data.table and Rcpp offer dramatic performance enhancements. Also, parallel computing is straightforward
  4. Not Just for Statisticians:
    • R’s applications range from web development to machine learning (tidymodels, caret, mlr3) to reporting (Quarto, bookdown)

Closing Thought: Every tool has its strengths. The key is to understand and leverage them effectively.

Tidyverse

Tidyverse

  • With SO many packages, there will inevitable be SO many opinions

  • The tidyverse is a unified opinionated collection of R packages designed for data science

  • All packages share an underlying design philosophy, grammar, and data structures

  • Today R has in essense become two dialects base and tidyverse

  • Note: This course will focus solely on the tidyverse dialect

We’ll spend a lot more time on going over the details of the Tidyverse!

Intermezzo: A brief course History

  • From ~20 to ~150 students

  • This year materials have been revised to suit large class room teaching

  • The teaching team will do out best to support your learning, but it is important to emphasise, that you will have to take responsibility for following the course curriculum!

General Course Outline

Tuesdays 08.00 - 12.00

  • 08.00 - 08.30 Recap of key points from last weeks exercises
  • 08.30 - 08.45 Introduction to theme of the day
  • 08.45 - 09.00 Break
  • 09.00 - 12.00 Exercises

About

Course Description

  • Basically, what can you expect to learn and what do I expect that you learn: DTU Course Base

Course Resources

Course format

  • Active Learning: Very strong emphasis on students working in groups, rather than me talking
  • The focus is on you working actively, not me talking
  • I will not go through all preparation materials in class
  • Proper preparation is a prerequisite for completing lab exercises and maximising course yield
  • I focus on supporting your independence, hence for some exercises you will have to seek out information (I’m not a good data scientist, I’m just slightly better at googling than others)

Exercise feedback

Weekly

  • An exercise question will be highlighted
  • Each group is responsible for crafting an answer to this highlighted question
  • These answers will be hand-ins
  • The following week, we will choose a random answer to be discussed in plenum
  • Note: This starts from lab 2

Group Formation

  • Modern Bio Data Science is a team sport!
  • You have to form a group of 4-5 students with a Shared Bio Data Science / Bioinformatics Area of Interest
  • You will work in these groups throughout the course
  • You will do the final project in these groups
  • You will attend the exam in these groups
  • Group work is a very important meta skill for an engineer!
  • Please fill in groups, see schedule for lab 1
  • If you do not have a group, fill in your id and interest at an available group and someone might join you
  • I aim to let you decide on the groups, I will of course be happy to help if needed

How to succeed in this course

  • Prepare materials as instructed!
  • Show up for class!
  • Do the exercises!
  • Do the project work!

Basically, show up, follow the curriculum and you will do fine!

Questions?