QTM 151 - Introduction to Statistical Computing II

Lecture 07 - Applications 01: Simulation

Danilo Freire

Emory University

Welcome back! Nice to see you all! 😊

Brief recap of last class 📚

Last class we covered:

None as a placeholder or missing value in lists
Replacing values in lists using indexing: list[0] = value
How to use list.append() and list.extend() to add elements to a list
Expanding lists with the * operator: list = [1, 2, 3] * 2
Concatenating lists with the + operator: list1 + list2
How to write a for loop to iterate over a list

for item in list:
    do something

Today’s plan

We will learn to:

Use random numbers and simulations to solve problems
Draw random samples from different distributions
Create subplots to compare different simulations
Write nested loops
(Time permitting/at home) Write for loops with if statements

Visualisation of random variables 🎲

Simulate different distributions

We can use random numbers to simulate different distributions
As we saw before, we can use the random module to generate random numbers
We combine the random numbers generator with a numpy function to simulate different distributions (as we did with the normal distribution)
np.random.distribution_name(parameters): normal, uniform, binomial, chi-square, etc.
More information about the function can be found in the numpy documentation

Consider a sample with \(n\) observations

\(X = \begin{pmatrix} X_1 \\ X_2 \\ \vdots \\ X_n \end{pmatrix}\)

First, we need to import the necessary packages, then we can simulate the random variables

Simulate different distributions

# Import necessary packages
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Set seed for reproducibility
np.random.seed(151)

# Sample size
n = 10000 

# Simulate random variables
vec_normal = np.random.normal(loc = 7, scale = 5, size = n)
vec_chisqr = np.random.chisquare(df = 1, size = n)
vec_unif   = np.random.uniform(low = -3,high = 5, size = n)

# Check their means
print("Normal mean: " + str(np.mean(vec_normal)))
print("Chi-square mean: " + str(np.mean(vec_chisqr)))
print("Uniform mean: " + str(np.mean(vec_unif)))

Normal mean: 7.0273290325979465
Chi-square mean: 1.0051975023328465
Uniform mean: 1.0556049464433184

Multiple plots in the same figure 📊

Create subplots

We can create multiple plots in the same figure using the plt.subplots() function
This function returns a figure and an array of axes
- fig, ax = plt.subplots(nrows, ncols)
You can use any name instead of fig or ax (e.g., my_figures, axs (preferred over axes to avoid confusion), list_subfig, etc)
nrows is the number of rows of subplots
ncols is the number of columns of subplots
You can then plot on each axis using the plot() function, as you would with a single plot

More information can be found in the matplotlib documentation

Create subplots

Let’s explain the code line-by-line

# Create a plot with 1 row, 2 columns
# You will create a list of subfigures "list_subfig"
# You can choose whichever name you like
# The option "figsize" indicates the (width,height) of the graph in inches
fig, list_subfig = plt.subplots(1, 2, figsize = (10, 3))

# The tight layout option ensures that the axes are not overlapping
plt.tight_layout()

# Adjust the spacing between subplots
plt.subplots_adjust(wspace=0.3)  # Adjust the value as needed

# First Figure
list_subfig[0].hist(x = vec_normal)
list_subfig[0].set_title("Normal Distribution")
list_subfig[0].set_xlabel("Value")
list_subfig[0].set_ylabel("Frequency")

# Second Figure
list_subfig[1].hist(x = vec_unif)
list_subfig[1].set_title("Uniform Distribution")
list_subfig[1].set_xlabel("Value")
list_subfig[1].set_ylabel("Frequency")
plt.show()

Create subplots

Let’s see the output

Try it yourself!

Do a version with three plots in the same row
What happens if you remove the “plt.tight_layout()” command?
What happens if you change the “figsize”?
Appendix 01

Create sequences of numbers

We use list(range(start, stop, step)) to create a sequence of numbers
start is the first number in the sequence
stop is the last number in the sequence (not included)
step is the difference between each number in the sequence
If start is not provided, it defaults to 0
For example, list(range(1, 10, 2)) will create the list [1, 3, 5, 7, 9]
More information can be found in the Python documentation

This design choice is intentional and follows the principle of half-open intervals (includes the start value but excludes the stop value), which is common in computer science
Since Python’s index starts at 0, this design choice makes it easier to work with lists and arrays. For instance, list[0:3] will return the first three elements of the list
Here are some examples:

# [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
print(list(range(10)))  

# [1, 3, 5, 7, 9]
print(list(range(1, 10, 2)))  

# [10, 9, 8, 7, 6, 5, 4, 3, 2, 1]
print(list(range(10, 0, -1)))

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[1, 3, 5, 7, 9]
[10, 9, 8, 7, 6, 5, 4, 3, 2, 1]

Try it yourself!

An easy one

Use list(range()) to create the following lists:
- Create a list with the numbers from 0 to 9
- Create a list with the odd numbers from 1 to 9
We will use the range() function in other ways soon
Appendix 02

Nested loops 🔄

Nested loops

We can nest loops inside each other to iterate over multiple dimensions 🤯
The inner loop runs to completion for each iteration of the outer loop
This is useful when we need to iterate over a matrix or a list of lists
The syntax is simple: just write a loop inside another loop
The inner loop is indented twice: once for the outer loop and once for the inner loop
For example:

for i in range(3):
    for j in range(2):
        print(f"i = {i}, j = {j}")

This will print:

i = 0, j = 0
i = 0, j = 1
i = 1, j = 0
i = 1, j = 1
i = 2, j = 0
i = 2, j = 1

Why does it print this way?

Using simulations to solve statistical problems 🎲

The central limit theorem

The central limit theorem (CLT) is a fundamental concept in statistics
It states that the distribution of the mean (or sum) of a large number of independent, identically distributed random variables approaches a normal distribution, regardless of the original distribution
This is true even if the original distribution is not normal

The larger the sample size, the closer the distribution of the sample mean is to a normal distribution
We can simulate that using random numbers and “prove” the CLT
Let’s simulate the CLT using the uniform distribution

The central limit theorem

# One way is to write this with repeated code chunks
# Each time will start the process of generating new data from scratch.

num_simulations = 2000

# Simulate with sample size one
sample_size = 1
vec_xbar = [None] * num_simulations
for iteration in range(num_simulations):
    vec_unif  = np.random.uniform(low = -2, high=2, size = sample_size)
    vec_xbar[iteration] = vec_unif.mean()
plt.hist(vec_xbar)
plt.title("Distribution of Xbar with size 1")
plt.ylabel("Frequency")
plt.xlabel("Values of Xbar")
plt.show()

The central limit theorem

# Simulate with sample size 10
sample_size = 10
vec_xbar = [None] * num_simulations
for iteration in range(num_simulations):
    vec_unif  = np.random.uniform(low = -2, high=2, size = sample_size)
    vec_xbar[iteration] = vec_unif.mean()
plt.hist(vec_xbar)
plt.title("Distribution of Xbar with size 10")
plt.ylabel("Frequency")
plt.xlabel("Values of Xbar")
plt.show()

The central limit theorem

# Simulate with sample size 50
sample_size = 50
vec_xbar = [None] * num_simulations
for iteration in range(num_simulations):
    vec_unif  = np.random.uniform(low = -2, high=2, size = sample_size)
    vec_xbar[iteration] = vec_unif.mean()
plt.hist(vec_xbar)
plt.title("Distribution of Xbar with size 50")
plt.ylabel("Frequency")
plt.xlabel("Values of Xbar")
plt.show()

Hmm… wait a minute!

A technical side note

Why do we have to use range(num_simulations) here and not just num_simulations? 🤔
Let’s see what happens if we use num_simulations instead (error!)
Can anyone explain the difference between the two? Appendix 03

# Simulate with sample size 100
sample_size = 100
vec_xbar = [None] * num_simulations
for iteration in num_simulations:
    vec_unif  = np.random.uniform(low = -2, high=2, size = sample_size)
    vec_xbar[iteration] = vec_unif.mean()
plt.hist(vec_xbar)
plt.title("Distribution of Xbar with size 100")
plt.ylabel("Frequency")
plt.xlabel("Values of Xbar")
plt.show()

An error occurred while executing the following cell:
------------------
# Simulate with sample size 100
sample_size = 100
vec_xbar = [None] * num_simulations
for iteration in num_simulations:
    vec_unif  = np.random.uniform(low = -2, high=2, size = sample_size)
    vec_xbar[iteration] = vec_unif.mean()
plt.hist(vec_xbar)
plt.title("Distribution of Xbar with size 100")
plt.ylabel("Frequency")
plt.xlabel("Values of Xbar")
plt.show()
------------------


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[10], line 4
      2 sample_size = 100
      3 vec_xbar = [None] * num_simulations
----> 4 for iteration in num_simulations:
      5     vec_unif  = np.random.uniform(low = -2, high=2, size = sample_size)
      6     vec_xbar[iteration] = vec_unif.mean()

TypeError: 'int' object is not iterable

Now back to the simulation

We can see that the distribution of the sample mean approaches a normal distribution as the sample size increases
But we can make the code more efficient by using nested loops
We can write a loop to simulate the sample mean for different numbers of simulations
The, we write another loop to simulate the sample means

Nested loops

num_simulations = 2000
sample_size_list = [1,10,50,100,200]

for sample_size in sample_size_list:

    # The following command a vector null values, of length "num_simulations"
    vec_xbar = [None] * num_simulations
    
    for iteration in range(num_simulations):
            vec_unif  = np.random.uniform(low = -2, high=2, size = sample_size)
            vec_xbar[iteration] = vec_unif.mean()
    plt.hist(vec_xbar)
    plt.title("Distribution of Xbar when n is " + str(sample_size))
    plt.ylabel("Frequency")
    plt.xlabel("Values of Xbar")
    plt.show()

Try it yourself!

Repeat the above simulation with a few changes
Use a Chi-square distribution with (df = 1) instead of a normal
Appendix 04

Try it yourself!

Last one! 😅

Write code that puts all the figures in the same row
Appendix 05

And that’s it for today! 🎉

Thank you for your attention! 🙏

Appendix 01

Three plots in the same row

fig, list_subfig = plt.subplots(1, 3, figsize = (15, 3))

# The tight layout option ensures that the axes are not overlapping
plt.tight_layout()

# Adjust the spacing between subplots
plt.subplots_adjust(wspace=0.3)  # Adjust the value as needed

# First Figure

list_subfig[0].hist(x = vec_normal)
list_subfig[0].set_title("Normal Distribution")
list_subfig[0].set_xlabel("Value")
list_subfig[0].set_ylabel("Frequency")

# Second Figure
list_subfig[1].hist(x = vec_unif)
list_subfig[1].set_title("Uniform Distribution")
list_subfig[1].set_xlabel("Value")
list_subfig[1].set_ylabel("Frequency")

# Third Figure
list_subfig[2].hist(x = vec_chisqr)
list_subfig[2].set_title("Chi-square Distribution")
list_subfig[2].set_xlabel("Value")
list_subfig[2].set_ylabel("Frequency")
plt.show()

Back to exercise

Appendix 01

What happens if you remove the “plt.tight_layout()” command?

fig, list_subfig = plt.subplots(1, 3, figsize = (15, 3))

# Adjust the spacing between subplots
plt.subplots_adjust(wspace=0.3)  # Adjust the value as needed

# First Figure
list_subfig[0].hist(x = vec_normal)
list_subfig[0].set_title("Normal Distribution")
list_subfig[0].set_xlabel("Value")
list_subfig[0].set_ylabel("Frequency")

# Second Figure
list_subfig[1].hist(x = vec_unif)
list_subfig[1].set_title("Uniform Distribution")
list_subfig[1].set_xlabel("Value")
list_subfig[1].set_ylabel("Frequency")

# Third Figure
list_subfig[2].hist(x = vec_chisqr)
list_subfig[2].set_title("Chi-square Distribution")
list_subfig[2].set_xlabel("Value")
list_subfig[2].set_ylabel("Frequency")
plt.show()

Back to exercise

Appendix 01

What happens if you change the “figsize”?

fig, list_subfig = plt.subplots(1, 3, figsize = (10, 3))

# The tight layout option ensures that the axes are not overlapping
plt.tight_layout()

# Adjust the spacing between subplots
plt.subplots_adjust(wspace=0.3)  # Adjust the value as needed

# First Figure
list_subfig[0].hist(x = vec_normal)
list_subfig[0].set_title("Normal Distribution")
list_subfig[0].set_xlabel("Value")
list_subfig[0].set_ylabel("Frequency")

# Second Figure
list_subfig[1].hist(x = vec_unif)
list_subfig[1].set_title("Uniform Distribution")
list_subfig[1].set_xlabel("Value")
list_subfig[1].set_ylabel("Frequency")

# Third Figure
list_subfig[2].hist(x = vec_chisqr)
list_subfig[2].set_title("Chi-square Distribution")
list_subfig[2].set_xlabel("Value")
list_subfig[2].set_ylabel("Frequency")
plt.show()

Back to exercise

Appendix 02

Create a list with the numbers from 0 to 9

print(list(range(10)))

print(list(range(1, 10, 2)))

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[1, 3, 5, 7, 9]

Back to exercise

Appendix 03

Why do we have to use `range(num_simulations)` here?

The reason why we have to use range(num_simulations) is that the for loop expects an iterable object
If we use num_simulations, which is an integer (2000), we will get a TypeError because an integer is not iterable
Python sees the variable as only the number 2000, not a range of numbers from 0 to 2000
Therefore, the for loop will not work as expected and we need to use range(num_simulations) to iterate over the numbers from 0 to 1999
Is it clear now? 🤓

Back to exercise

Appendix 04

Repeat the simulation with a Chi-square distribution with df = 1

num_simulations = 2000
sample_size_list = [1,10,50,100,200]

for sample_size in sample_size_list:

    # The following command a vector null values, of length "num_simulations"
    vec_xbar = [None] * num_simulations
    
    for iteration in range(num_simulations):
            vec_chisqr  = np.random.chisquare(df = 1, size = sample_size)
            vec_xbar[iteration] = vec_chisqr.mean()
    plt.hist(vec_xbar)
    plt.title("Distribution of Xbar when n is " + str(sample_size))
    plt.ylabel("Frequency")
    plt.xlabel("Values of Xbar")
    plt.show()

Back to exercise

Appendix 05

Put all the figures in the same row

fig, list_subfig = plt.subplots(1, 5, figsize = (20, 3))

# The tight layout option ensures that the axes are not overlapping
plt.tight_layout()

# Adjust the spacing between subplots
plt.subplots_adjust(wspace=0.3)  # Adjust the value as needed

# Start the loop
num_simulations = 2000
sample_size_list = [1,10,50,100,200]

index = 0
for sample_size in sample_size_list:

    # The following command a vector null values, of length "num_simulations"
    vec_xbar = [None] * num_simulations
    
    for iteration in range(num_simulations):
            vec_chisqr  = np.random.chisquare(df = 1, size = sample_size)
            vec_xbar[iteration] = vec_chisqr.mean()
    list_subfig[index].hist(vec_xbar)
    list_subfig[index].set_title("Distribution of Xbar when n is " + str(sample_size))
    list_subfig[index].set_ylabel("Frequency")
    list_subfig[index].set_xlabel("Values of Xbar")
    index = index + 1
    plt.show()

Back to exercise

QTM 151 - Introduction to Statistical Computing II

Welcome back! Nice to see you all! 😊

Brief recap of last class 📚

Last class we covered:

Today’s plan

We will learn to:

Visualisation of random variables 🎲

Simulate different distributions

Simulate different distributions

Multiple plots in the same figure 📊

Create subplots

Create subplots

Let’s explain the code line-by-line

Create subplots

Let’s see the output

Try it yourself!

Create sequences of numbers

Try it yourself!

An easy one

Nested loops 🔄

Nested loops

Using simulations to solve statistical problems 🎲

The central limit theorem

The central limit theorem

The central limit theorem

The central limit theorem

Hmm… wait a minute!

A technical side note

Now back to the simulation

Nested loops

Try it yourself!

Try it yourself!

Last one! 😅

And that’s it for today! 🎉

Thank you for your attention! 🙏

Appendix 01

Three plots in the same row

Appendix 01

What happens if you remove the “plt.tight_layout()” command?

Appendix 01

What happens if you change the “figsize”?

Appendix 02

Create a list with the numbers from 0 to 9

Appendix 03

Why do we have to use range(num_simulations) here?

Appendix 04

Repeat the simulation with a Chi-square distribution with df = 1

Appendix 05

Put all the figures in the same row

Why do we have to use `range(num_simulations)` here?