Introduction to¶

title

with Application to Bioinformatics¶

- Day 2¶

Review Day 1¶

Go to Canvas, Modules -> Day 2 -> Review Day 1

~30 minutes

Variables and Types¶

1. Which of the following is of the type float?:
7.4

2. Match the following variables with their type:

var1 = 54             integer
var2 = [1,2,7,1,24]     list
var3 = 2.98             float
var4 = True             boolean

Literals¶

All literals have a type:

  • Strings (str)       ‘Hello’ “Hi”
  • Integers (int)     5
  • Floats (float)     3.14
  • Boolean (bool)     True or False
In [5]:
type(5.0)
Out[5]:
float

Variables¶

Used to store values and to assign them a name.

In [6]:
a = 3.14
a + 2
Out[6]:
5.140000000000001

Lists¶

A collection of values.

In [8]:
x = [1,5,3,7,8]
y = ['a','b','c']
type(y)
z = [1, 2, 3, 'a', 'b']

Comments¶

3. Which of the following symbols can be used to write comments in your code?

#

Operations¶

4. What happens if you do [1,2,5,11] + [87,2,43,3]?
[1,2,5,11,87,2,43,3]     The lists will be concatenated


5. How do you find out if the variable x is present in a the list mylist?
Two answers correct:

  1. x in mylist
  2. for l in mylist:
       if l == x:
           print('Found a match')

6. How do you find out if 5 is larger than 3 and the integer 4 is the same as the float 4? Fill in all the missing code.
5 > 3 and 4 == 4.0

Basic operations¶

Type         Operations

int           + - / ** % // ...
float           + -
/ * % // ...
string           +

In [11]:
a = 2
b = 5.46
c = [1,2,3,4]
d = [5,6,7,8]
e = 7
c * b
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [11], in <cell line: 6>()
      4 d = [5,6,7,8]
      5 e = 7
----> 6 c * b

TypeError: can't multiply sequence by non-int of type 'float'

Comparison/Logical/Membership operators¶

Drawing

In [15]:
a = [1,2,3,4,5,6,7,8]
b = 5
c = 10
b in a
b < c or c == 1
b not in a 
Out[15]:
False

Sequences¶

7. How do you select the second element in the variable mylist = [4,3,8,10]?
mylist[1]

8. Pair the following variables with whether they are mutable or immutable

var1 = 'my pretty string'       immutable
var2 = [1,2,3,4,5]             mutable
var3 = "hello world"             immutable
var4 = ['a', 'b', 'c', 'd']     mutable

9. Which of the following types are iterable?
Lists and strings

Indexing¶

Lists (and strings) are an ORDERED collection of elements where every element can be access through an index.

a[0] : first item in list a

REMEMBER! Indexing starts at 0 in python

In [17]:
a = [1,2,3,4,5]
b = ['a','b','c']
c = 'a random string'

c[2]
c[1:4]
Out[17]:
' ra'

Mutable / Immutable sequences and iterables¶

Lists are mutable object, meaning you can use an index to change the list, while strings are immutable and therefore not changeable.

An iterable sequence is anything you can loop over, ie, lists and strings.

In [19]:
a = [1,2,3,4,5]         # mutable
b = ['a','b','c']       # mutable
c = 'a random string'   # immutable

c[0] = 'A'
#a[0] = 42
c
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [19], in <cell line: 5>()
      2 b = ['a','b','c']       # mutable
      3 c = 'a random string'   # immutable
----> 5 c[0] = 'A'
      6 #a[0] = 42
      7 c

TypeError: 'str' object does not support item assignment

New data type: tuples¶

  • A tuple is an immutable sequence of objects
  • Unlike a list, nothing can be changed in a tuple
  • Still iterable
In [24]:
myTuple = (1,2,3,4,'a','b','c')
myTuple[0] = 42
#print(myTuple)
print(len(myTuple))
#for i in myTuple:
#     print(i)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [24], in <cell line: 2>()
      1 myTuple = (1,2,3,4,'a','b','c')
----> 2 myTuple[0] = 42
      3 #print(myTuple)
      4 print(len(myTuple))

TypeError: 'tuple' object does not support item assignment

If/ Else statements¶

10. How do you do to print ‘Yes’ if x is bigger than y?
if x > y:
 print('Yes')

In [25]:
a = 3
b = [1,2,3,4]
if a in b:
    print(str(a)+' is found in the list b')
else:
    print(str(a)+' is not in the list')
3 is found in the list b

Files and loops¶

How do you open a file handle to read a file called ‘somerandomfile.txt’?
fh = open('somerandomfile.txt')


The file in the previous question contains several lines, how do you print each line?

  1. for line in fh:
     print(line)

  2. for row in fh:
     print(row)

In [26]:
fh = open('../files/somerandomfile.txt','r', encoding = 'utf-8')
for line in fh:
    print(line.strip())
fh.close()
just a strange
file with
some
nonsense lines
In [27]:
numbers = [5,6,7,8]
i = 0
while i < len(numbers):
    print(numbers[i])
    i += 1
5
6
7
8

Questions?¶

Day 2¶

  • Pseudocode
  • Functions vs Methods

How to approach a coding task¶

Problem:
You have a VCF file with a larger number of samples. You are interested in only one of the samples (sample1) and one region (chr5, 1.000.000-1.005.000). What you want to know is whether this sample has any variants in this region, and if so, what variants.

Always write pseudocode!¶


Pseudocode is a description of what you want to do without actually using proper syntax

What is your input?¶

A VCF file that is iterable

Drawing

Basic Pseudocode:¶

  • Open file and loop over lines (ignore lines with #)
  • Identify lines where chromosome is 5 and position is between 1.000.000 and 1.005.000
  • Isolate the column that contains the genotype for sample1
  • Extract the genotypes only from the column
  • Check if the genotype contains any alternate alleles
  • Print any variants containing alternate alleles for this sample between specified region

Drawing

- Open file and loop over lines (ignore lines starting with #)

In [1]:
fh = open('/mnt/c/Users/Nina/Documents/courses/Python_Beginner_Course/genotypes.vcf', 'r', encoding = 'utf-8')
for line in fh:
    if not line.startswith('#'):  
        print(line.strip())
        break
fh.close()
# Next, find chromosome 5
1	10492	.	C	T	550.31	LOW_VQSLOD	AN=26;AC=2	GT:AD:DP:GQ:PGT:PID:PL	./.:0,0:0:.:.:.:.	./.:0,0:0:.:.:.:.	./.:0,0:0:.:.:.:.	./.:0,0:0:.:.:.:.	./.:0,0:0:.:.:.:.	0/1:12,7:19:99:0|1:10403_ACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAC_A:196,0,340	./.:0,0:0:.:.:.:.	./.:0,0:0:.:.:.:.	./.:0,0:0:.:.:.:.	./.:0,0:0:.:.:.:.	0/1:18,4:22:48:.:.:48,0,504	./.:0,0:0:.:.:.:.	./.:0,0:0:.:.:.:.

- Identify lines where chromosome is 5 and position is between 1.000.000 and 1.005.000

Drawing

In [2]:
fh = open('/mnt/c/Users/Nina/Documents/courses/Python_Beginner_Course/genotypes.vcf', 'r', encoding = 'utf-8')
for line in fh:
    if not line.startswith('#'):
        cols = line.strip().split('\t')
        if cols[0] == '5':
            print(cols)
            break
fh.close()

# Next, find the correct region
['5', '12041', '.', 'A', 'T', '18075.2', 'PASS', 'AN=26;AC=2', 'GT:AD:DP:GQ:PL', './.:0,0:0:.:.', './.:0,0:0:.:.', './.:0,0:0:.:.', './.:0,0:0:.:.', './.:0,0:0:.:.', './.:0,0:0:.:.', './.:0,0:0:.:.', './.:0,0:0:.:.', './.:0,0:0:.:.', './.:0,0:0:.:.', '0/1:15,6:21:99:142,0,391', './.:0,0:0:.:.', '0/1:16,17:33:99:442,0,422']

Drawing

In [5]:
fh = open('/mnt/c/Users/Nina/Documents/courses/Python_Beginner_Course/genotypes.vcf', 'r', encoding = 'utf-8')
for line in fh:
    if not line.startswith('#'):
        cols = line.strip().split('\t')
        if cols[0] == '5' and \
           int(cols[1]) >= 1000000 and int(cols[1]) <= 1005000:
                print(cols)
                break
fh.close()
# Next, find the genotypes for sample1
['5', '1000080', '.', 'A', 'T', '2557.1', 'PASS', 'AN=26;AC=2', 'GT:AD:DP:GQ:PL', '0/1:15,18:33:99:489,0,357', './.:0,0:0:.:.', './.:0,0:0:.:.', './.:0,0:0:.:.', './.:0,0:0:.:.', './.:0,0:0:.:.', './.:0,0:0:.:.', './.:0,0:0:.:.', '0/1:21,19:40:99:481,0,542', './.:0,0:0:.:.', './.:0,0:0:.:.', './.:0,0:0:.:.', './.:0,0:0:.:.']

- Isolate the column that contains the genotype for sample1

Drawing

In [6]:
fh = open('/mnt/c/Users/Nina/Documents/courses/Python_Beginner_Course/genotypes.vcf', 'r', encoding = 'utf-8')
for line in fh:
    if not line.startswith('#'):
        cols = line.strip().split('\t')
        if cols[0] == '5' and \
           int(cols[1]) >= 1000000 and int(cols[1]) <= 1005000:
                geno = cols[9]
                print(geno)
                break
fh.close()
# Next, extract the genotypes only
0/1:15,18:33:99:489,0,357

- Extract the genotypes only from the column

Drawing

In [7]:
fh = open('/mnt/c/Users/Nina/Documents/courses/Python_Beginner_Course/genotypes.vcf', 'r', encoding = 'utf-8')
for line in fh:
    if not line.startswith('#'):
        cols = line.strip().split('\t')
        if cols[0] == '5' and \
           int(cols[1]) >= 1000000 and int(cols[1]) <= 1005000:
                geno = cols[9].split(':')[0]
                print(geno)
                break
fh.close()
# Next, find in which positions sample1 has alternate alleles
0/1

- Check if the genotype contains any alternate alleles

Drawing

In [9]:
fh = open('/mnt/c/Users/Nina/Documents/courses/Python_Beginner_Course/genotypes.vcf', 'r', encoding = 'utf-8')
for line in fh:
    if not line.startswith('#'):
        cols = line.strip().split('\t')
        if cols[0] == '5' and \
           int(cols[1]) >= 1000000 and int(cols[1]) <= 1005000:
                geno = cols[9].split(':')[0]
                if geno in ['0/1', '1/1']:
                    print(geno)
fh.close()
#Next, print nicely
0/1
0/1
0/1
0/1
0/1
0/1
0/1
0/1
0/1
0/1
0/1
0/1
0/1
0/1
0/1
0/1
0/1
0/1
0/1
0/1
0/1
0/1

- Print any variants containing alternate alleles for this sample between specified region

Drawing

In [10]:
fh = open('/mnt/c/Users/Nina/Documents/courses/Python_Beginner_Course/genotypes.vcf', 'r', encoding = 'utf-8')
res = []
for line in fh:
    if not line.startswith('#'):
        cols = line.strip().split('\t')
        if cols[0] == '5' and \
           int(cols[1]) >= 1000000 and int(cols[1]) <= 1005000:
                geno = cols[9].split(':')[0]
                if geno in ['0/1', '1/1']:
                    var = cols[0]+':'+cols[1]+'_'+cols[3]+'-'+cols[4]
                 #   print(var+' has genotype: '+geno)
                    res.append(var)
fh.close()
print(res)
['5:1000080_A-T', '5:1000156_G-A', '5:1001097_C-A', '5:1001193_C-T', '5:1001245_T-C', '5:1001339_C-T', '5:1001344_G-C', '5:1001683_G-T', '5:1001755_G-A', '5:1002374_G-A', '5:1002382_G-C', '5:1002620_T-C', '5:1002722_G-A', '5:1002819_C-A', '5:1003043_G-T', '5:1003099_C-T', '5:1003135_G-A', '5:1004648_A-G', '5:1004650_A-C', '5:1004665_A-G', '5:1004702_G-T', '5:1004879_T-C']

→ Exercises Day 2

3 options:

  1. <p style="color:green";>Green exercise</p>
  2. <p style="color:#FFBF00";>Yellow exercise</p>
  3. <p style="color:red";>Red exercise</p>
  4. ChatGPT exercise

Level of complexity increases with each exercises
New to programming: Do Green exercise and possibly Yellow exercise + ChatGPT
More experienced: Do Yellow exercise and/or Red exercise + ChatGPT

More useful functions and methods¶

What is the difference between a function and a method?

A method always belongs to an object of a specific class, a function does not have to. For example:

print('a string') and print(42) both works, even though one is a string and one is an integer

'a string '.strip() works, but [1,2,3,4].strip() does not work. strip() is a method that only works on strings

What does it matter to me?

For now, you mostly need to be aware of the difference, and know the different syntaxes:

A function:
functionName()

A method:
<object>.methodName()

In [41]:
len([1,2,3])
len('a string')

'a string  '.strip()
[1,2,3].strip() 
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Input In [41], in <cell line: 5>()
      2 len('a string')
      4 'a string  '.strip()
----> 5 [1,2,3].strip()

AttributeError: 'list' object has no attribute 'strip'

Functions¶

Drawing

Python Built-in functions

Drawing

In [44]:
abs(-5)
Out[44]:
5

Drawing

In [46]:
sum([1,2,35,23,88,4])
Out[46]:
153

From Python documentation¶


Drawing

In [51]:
sum([1,2,3,4],10)
Out[51]:
20

Drawing

In [53]:
b = round(3.234556, 2)
a = 'my string'
print(b)
3.23

Methods¶

Useful operations on strings¶

Drawing

Drawing

In [58]:
'    spaciou   sWith5678.com'.strip('mo.c')
Out[58]:
'    spaciou   sWith5678'

Drawing

In [59]:
'    spaciou   sWith5678.com\n'.lstrip()
Out[59]:
'spaciou   sWith5678.com\n'

Drawing

In [60]:
'    spaciou   sWith5678.com\n'.rstrip()
Out[60]:
'    spaciou   sWith5678.com'

Drawing

In [61]:
a = '  split a string into a list '
a.split(maxsplit=3)
Out[61]:
['split', 'a', 'string', 'into a list ']

Drawing

In [65]:
'|'.join('a string already')
' '.join(['a', 'b', 'c', 'd'])
#' '.join([1,2,3])
Out[65]:
'a b c d'

Drawing

Drawing

In [67]:
'long string'.startswith('ng', 2)
#'long string'.endswith('nt')
Out[67]:
True

Drawing

Drawing

In [69]:
'LongRandomString'.lower()
'LongRandomString'.upper()
Out[69]:
'LONGRANDOMSTRING'

Useful operations on Mutable sequences¶


Drawing

In [ ]:
a = [1,2,3,4,5,5,5,5]
a.append(6)
a.pop(2)
a.reverse()
a.remove(5)

b = (1,2,3,4)
c = [1,2,3,4]
c.append(5)
c

Summary¶

  • Tuples are immutable sequences of objects
  • Always plan your approach before you start coding
  • A method always belongs to an object of a specific class, a function does not have to
  • The official Python documentation describes the syntax for all built-in functions and methods

→ Exercises Day 2

3 options:

  1. <p style="color:green";>Green exercise</p>
  2. <p style="color:#FFBF00";>Yellow exercise</p>
  3. <p style="color:red";>Red exercise</p>
  4. ChatGPT exercise

Level of complexity increases with each exercises
New to programming: Do Green exercise and possibly Yellow exercisev + ChatGPT
More experienced: Do Yellow exercise and/or Red exercise + ChatGPT

IMDb¶

Download the 250.imdb file from the course website

This format of this file is:

  • Line by line
  • Columns separated by the | character
  • Header starting with #

Drawing

# Votes | Rating | Year | Runtime | URL | Genres | Title

Find the movie with the highest rating¶

Drawing

Write step-by-step pseudocode

Drawing

In [ ]:
 
  • Open file
  • Initiate counter to keep track of highest rating and movie. Start counter at 0
  • Loop over all lines not starting with '#'
  • Strip and split the lines into a list
  • Save the element containing the rating from the list into a variable
  • If current rating is higher than the rating in the counter, replace counter value with rating and movie
  • Close file
  • Print counter

Drawing

In [ ]:
fh   = open('../downloads/250.imdb', 'r', encoding = 'utf-8')
best = [0,'']           # here we save the rating and which movie
for line in fh:
    if not line.startswith('#'):
        cols   = line.strip().split('|')
        rating = float(cols[1].strip())
        if rating > best[0]:           # if the rating is higher than previous highest, update best
            best = [rating,cols[6]]
fh.close()
print(best)