Lecture Lab 1

Leon Eyrich Jessen

Course Introduction

DATADATADATA: Data Hoarding

Increasing the Value of Data Requires Activation!

Data Driven Decision Making

Because we’ve always done it this way!

Your job as a Bio Data Scientist:

Activate data
Extract insights
Communicate to non-data stakeholders
Facilitate data driven decision making

Levering data driven decision making allows the company to gain a competitive edge and this is where you Bio Data Science skills are indispensable!

You value as a Bio Data Scientist / Bioinformatician

In your career, your task will be to create value
This is regardless of whether you plan to work in indstry or pursue a research career
What you do has to create value
Creating value requires skills
Skills need to be learned
So, why are you here?

You’re here to gain skills, which will allow you to generate value!

R for Bio Data Science - What is it?

In essence: The art of converting numbers to value
- Ingest data
- Transform, wrangle, visualise, model
- Output insights

R for Bio Data Science - Intrinsically interdisciplinary

Why “Bio” in R for Bio Data Science?

What is the motivation for this Course?

What will you learn?

The craft of going from raw extracted data to insights
Advanced data visualisation
Collaborative project oriented coding
All with an emphasis on reproducibility and communication

R

Introducing R: A Journey into Bio Data Science

Open-source programming language
Essential tool for statistics & data visualization
Widely used in bioinformatics and data science
Dynamic community & vast library of packages

“To understand computations in R, two slogans are helpful: Everything that exists is an object. Everything that happens is a function call.” – John Chambers (creator of the S language, of which R is an implementation).

The Roots and Rise of R

Originated from the ‘S’ language at Bell Laboratories in the 1970s
S was proprietary, so basically R is an open source implementation of S and was officially released in 1995
This similar to Linux vs. Unix
A leader in statistical computing. Powers many academic research & industry projects
E.g. Crucial in genomics, where R aids in decoding biological data
R comes with a very large and well proven built in tools for data analysis

A Few Examples of Functional Programming

You can approach R as

an object-oriented programming language

Let’s say we have this vector

my_vector <- c(49, 31, 24, 35, 71, 7, 36, 23, 67, 37)

Now, we want to compute the mean, we can do:

Object Oriented Approach:

Vector <- R6::R6Class("Vector",
  public = list(
    data = NULL,
    initialize = function(data) {
      if (!is.numeric(data)) {
        stop("Data should be numeric.")
      }
      self$data <- data
    },
    mean = function() {
      return(sum(self$data) / length(self$data))
    }
  )
)
numbers <- Vector$new(my_vector)
print(numbers$mean())

[1] 38

A Few Examples of Functional Programming

You can approach R as

an object-oriented programming language
a imperative programming language

Let’s say we have this vector

my_vector <- c(49, 31, 24, 35, 71, 7, 36, 23, 67, 37)

Now, we want to compute the mean, we can do:

Imperative Approach:

my_sum <- 0
for( i in 1:length(my_vector) ){
  my_sum <- my_sum + my_vector[i]
}
my_mean <- my_sum / length(my_vector)
print(my_mean)

[1] 38

A Few Examples of Functional Programming

You can approach R as

an object-oriented programming language
a imperative programming language
a functional programming language

The code on the right all performs the same task, but which do you think is:

simpler to read and understand?
faster to execute?

In this course we will work with R in its native form - a fully fledged functional programming language

Let’s say we have this vector

my_vector <- c(49, 31, 24, 35, 71, 7, 36, 23, 67, 37)

Now, we want to compute the mean, we can do:

Functional approach:

my_mean <- mean(my_vector)
print(my_mean)

[1] 38

You simply call functions on objects

Standard Deviation

sd(my_vector)

[1] 19.70899

Median

median(my_vector)

[1] 35.5

Permute

sample(my_vector)

 [1]  7 23 31 35 37 24 49 71 36 67

Bootstrap

sample(my_vector, replace = TRUE)

 [1] 67 31 49 31 24 31 67 36 37 37

…and tons more!

“R is not a real programming language”: Debunking Myths I

R is Turing-complete:
- R can theoretically solve any computational problem. Foundational concept shared with e.g. Python, C++, Java, etc.
R is fully capable in Production:
- E.g. shiny apps used in industry and R comes with an ecosystem supporting reproducibility in production settings.
Comprehensive Ecosystem:
- CRAN contains ~20,000 packages. Also Bioconductor is a gold standard for bioinformatics software.
Interoperability:
- Seamless integration with other languages (C, C++, Fortran, and Python) using packages like Rcpp and reticulate.

“R is not a real programming language”: Debunking Myths II

Advanced Programming Features:
- Supports object-oriented, functional, and imperative programming paradigms. Flexible metaprogramming with capabilities like non-standard evaluation
R’s Active & Growing Community:
- Annual global R conferences and numerous local user groups and also: tidyverse
Performance:
- R is interpreted and can be slower, packages like data.table and Rcpp offer dramatic performance enhancements. Also, parallel computing is straightforward
Not Just for Statisticians:
- R’s applications range from web development to machine learning (tidymodels, caret, mlr3) to reporting (Quarto, bookdown)

Closing Thought: Every tool has its strengths. The key is to understand and leverage them effectively.

Tidyverse

With SO many packages, there will inevitable be SO many opinions
The tidyverse is a unified opinionated collection of R packages designed for data science
All packages share an underlying design philosophy, grammar, and data structures
Today R has in essense become two dialects base and tidyverse
Note: This course will focus solely on the tidyverse dialect

We’ll spend a lot more time on going over the details of the Tidyverse!

Intermezzo: A brief course History

From ~20 to ~150 students
This year materials have been revised to suit large class room teaching
The teaching team will do out best to support your learning, but it is important to emphasise, that you will have to take responsibility for following the course curriculum!

General Course Outline

Tuesdays 08.00 - 12.00

08.00 - 08.30 Recap of key points from last weeks exercises
08.30 - 08.45 Introduction to theme of the day
08.45 - 09.00 Break
09.00 - 12.00 Exercises

About

Course Description

Basically, what can you expect to learn and what do I expect that you learn: DTU Course Base

Course Resources

Text Book: “R for Data Science 2e” by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund
Course site: https://r4bds.github.io

Course format

Active Learning: Very strong emphasis on students working in groups, rather than me talking
The focus is on you working actively, not me talking
I will not go through all preparation materials in class
Proper preparation is a prerequisite for completing lab exercises and maximising course yield
I focus on supporting your independence, hence for some exercises you will have to seek out information (I’m not a good data scientist, I’m just slightly better at googling than others)

Exercise feedback

Weekly

An exercise question will be highlighted
Each group is responsible for crafting an answer to this highlighted question
These answers will be hand-ins
The following week, we will choose a random answer to be discussed in plenum
Note: This starts from lab 2

Group Formation

Modern Bio Data Science is a team sport!
You have to form a group of 4-5 students with a Shared Bio Data Science / Bioinformatics Area of Interest
You will work in these groups throughout the course
You will do the final project in these groups
You will attend the exam in these groups
Group work is a very important meta skill for an engineer!
Please fill in groups, see schedule for lab 1
If you do not have a group, fill in your id and interest at an available group and someone might join you
I aim to let you decide on the groups, I will of course be happy to help if needed

How to succeed in this course

Prepare materials as instructed!
Show up for class!
Do the exercises!
Do the project work!

Basically, show up, follow the curriculum and you will do fine!

Lecture Lab 1

Course Introduction

DATADATADATA: Data Hoarding

Increasing the Value of Data Requires Activation!

Data Driven Decision Making

You value as a Bio Data Scientist / Bioinformatician

You’re here to gain skills, which will allow you to generate value!

R for Bio Data Science - What is it?

R for Bio Data Science - Intrinsically interdisciplinary

Why “Bio” in R for Bio Data Science?

What is the motivation for this Course?

What is the motivation for this Course?

What will you learn?

R

Introducing R: A Journey into Bio Data Science

The Roots and Rise of R

A Few Examples of Functional Programming

A Few Examples of Functional Programming

A Few Examples of Functional Programming

You simply call functions on objects

“R is not a real programming language”: Debunking Myths I

“R is not a real programming language”: Debunking Myths II

Tidyverse

Tidyverse

Intermezzo: A brief course History

General Course Outline

About

Course Description

Course Resources

Course format

Exercise feedback

Weekly

Group Formation

How to succeed in this course

Questions?