QTM 151 - Introduction to Statistical Computing II

Lecture 07 - Applications 01: Simulation

Danilo Freire

Emory University

23 September, 2024

Welcome back! Nice to see you all! 😊

Brief recap of last class 📚

Last class we covered:

  • None as a placeholder or missing value in lists
  • Replacing values in lists using indexing: list[0] = value
  • How to use list.append() and list.extend() to add elements to a list
  • Expanding lists with the * operator: list = [1, 2, 3] * 2
  • Concatenating lists with the + operator: list1 + list2
  • How to write a for loop to iterate over a list
for item in list:
    do something

Today’s plan

We will learn to:

  • Use random numbers and simulations to solve problems
  • Draw random samples from different distributions
  • Create subplots to compare different simulations
  • Write nested loops
  • (Time permitting/at home) Write for loops with if statements

Visualisation of random variables 🎲

Simulate different distributions

  • We can use random numbers to simulate different distributions
  • As we saw before, we can use the random module to generate random numbers
  • We combine the random numbers generator with a numpy function to simulate different distributions (as we did with the normal distribution)
  • np.random.distribution_name(parameters): normal, uniform, binomial, chi-square, etc.
  • More information about the function can be found in the numpy documentation
  • Consider a sample with \(n\) observations

\(X = \begin{pmatrix} X_1 \\ X_2 \\ \vdots \\ X_n \end{pmatrix}\)

  • First, we need to import the necessary packages, then we can simulate the random variables

Simulate different distributions

# Import necessary packages
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Set seed for reproducibility
np.random.seed(151)

# Sample size
n = 10000 

# Simulate random variables
vec_normal = np.random.normal(loc = 7, scale = 5, size = n)
vec_chisqr = np.random.chisquare(df = 1, size = n)
vec_unif   = np.random.uniform(low = -3,high = 5, size = n)

# Check their means
print("Normal mean: " + str(np.mean(vec_normal)))
print("Chi-square mean: " + str(np.mean(vec_chisqr)))
print("Uniform mean: " + str(np.mean(vec_unif)))
Normal mean: 7.0273290325979465
Chi-square mean: 1.0051975023328465
Uniform mean: 1.0556049464433184

Multiple plots in the same figure 📊

Create subplots

  • We can create multiple plots in the same figure using the plt.subplots() function
  • This function returns a figure and an array of axes
    • fig, ax = plt.subplots(nrows, ncols)
  • You can use any name instead of fig or ax (e.g., my_figures, axs (preferred over axes to avoid confusion), list_subfig, etc)
  • nrows is the number of rows of subplots
  • ncols is the number of columns of subplots
  • You can then plot on each axis using the plot() function, as you would with a single plot

Create subplots

Let’s explain the code line-by-line

# Create a plot with 1 row, 2 columns
# You will create a list of subfigures "list_subfig"
# You can choose whichever name you like
# The option "figsize" indicates the (width,height) of the graph in inches
fig, list_subfig = plt.subplots(1, 2, figsize = (10, 3))

# The tight layout option ensures that the axes are not overlapping
plt.tight_layout()

# Adjust the spacing between subplots
plt.subplots_adjust(wspace=0.3)  # Adjust the value as needed

# First Figure
list_subfig[0].hist(x = vec_normal)
list_subfig[0].set_title("Normal Distribution")
list_subfig[0].set_xlabel("Value")
list_subfig[0].set_ylabel("Frequency")

# Second Figure
list_subfig[1].hist(x = vec_unif)
list_subfig[1].set_title("Uniform Distribution")
list_subfig[1].set_xlabel("Value")
list_subfig[1].set_ylabel("Frequency")
plt.show()

Create subplots

Let’s see the output

Try it yourself!

  • Do a version with three plots in the same row
  • What happens if you remove the “plt.tight_layout()” command?
  • What happens if you change the “figsize”?
  • Appendix 01

Create sequences of numbers

  • We use list(range(start, stop, step)) to create a sequence of numbers
  • start is the first number in the sequence
  • stop is the last number in the sequence (not included)
  • step is the difference between each number in the sequence
  • If start is not provided, it defaults to 0
  • For example, list(range(1, 10, 2)) will create the list [1, 3, 5, 7, 9]
  • More information can be found in the Python documentation
  • This design choice is intentional and follows the principle of half-open intervals (includes the start value but excludes the stop value), which is common in computer science
  • Since Python’s index starts at 0, this design choice makes it easier to work with lists and arrays. For instance, list[0:3] will return the first three elements of the list
  • Here are some examples:
# [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
print(list(range(10)))  

# [1, 3, 5, 7, 9]
print(list(range(1, 10, 2)))  

# [10, 9, 8, 7, 6, 5, 4, 3, 2, 1]
print(list(range(10, 0, -1)))  
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[1, 3, 5, 7, 9]
[10, 9, 8, 7, 6, 5, 4, 3, 2, 1]

Try it yourself!

An easy one

  • Use list(range()) to create the following lists:
    • Create a list with the numbers from 0 to 9
    • Create a list with the odd numbers from 1 to 9
  • We will use the range() function in other ways soon
  • Appendix 02

Nested loops 🔄

Nested loops

  • We can nest loops inside each other to iterate over multiple dimensions 🤯
  • The inner loop runs to completion for each iteration of the outer loop
  • This is useful when we need to iterate over a matrix or a list of lists
  • The syntax is simple: just write a loop inside another loop
  • The inner loop is indented twice: once for the outer loop and once for the inner loop
  • For example:
for i in range(3):
    for j in range(2):
        print(f"i = {i}, j = {j}")
  • This will print:
i = 0, j = 0
i = 0, j = 1
i = 1, j = 0
i = 1, j = 1
i = 2, j = 0
i = 2, j = 1
  • Why does it print this way?

Using simulations to solve statistical problems 🎲

The central limit theorem

  • The central limit theorem (CLT) is a fundamental concept in statistics
  • It states that the distribution of the mean (or sum) of a large number of independent, identically distributed random variables approaches a normal distribution, regardless of the original distribution
  • This is true even if the original distribution is not normal

  • The larger the sample size, the closer the distribution of the sample mean is to a normal distribution
  • We can simulate that using random numbers and “prove” the CLT
  • Let’s simulate the CLT using the uniform distribution

The central limit theorem

# One way is to write this with repeated code chunks
# Each time will start the process of generating new data from scratch.

num_simulations = 2000

# Simulate with sample size one
sample_size = 1
vec_xbar = [None] * num_simulations
for iteration in range(num_simulations):
    vec_unif  = np.random.uniform(low = -2, high=2, size = sample_size)
    vec_xbar[iteration] = vec_unif.mean()
plt.hist(vec_xbar)
plt.title("Distribution of Xbar with size 1")
plt.ylabel("Frequency")
plt.xlabel("Values of Xbar")
plt.show()

The central limit theorem

# Simulate with sample size 10
sample_size = 10
vec_xbar = [None] * num_simulations
for iteration in range(num_simulations):
    vec_unif  = np.random.uniform(low = -2, high=2, size = sample_size)
    vec_xbar[iteration] = vec_unif.mean()
plt.hist(vec_xbar)
plt.title("Distribution of Xbar with size 10")
plt.ylabel("Frequency")
plt.xlabel("Values of Xbar")
plt.show()

The central limit theorem

# Simulate with sample size 50
sample_size = 50
vec_xbar = [None] * num_simulations
for iteration in range(num_simulations):
    vec_unif  = np.random.uniform(low = -2, high=2, size = sample_size)
    vec_xbar[iteration] = vec_unif.mean()
plt.hist(vec_xbar)
plt.title("Distribution of Xbar with size 50")
plt.ylabel("Frequency")
plt.xlabel("Values of Xbar")
plt.show()

Hmm… wait a minute!

A technical side note

  • Why do we have to use range(num_simulations) here and not just num_simulations? 🤔
  • Let’s see what happens if we use num_simulations instead (error!)
  • Can anyone explain the difference between the two? Appendix 03
# Simulate with sample size 100
sample_size = 100
vec_xbar = [None] * num_simulations
for iteration in num_simulations:
    vec_unif  = np.random.uniform(low = -2, high=2, size = sample_size)
    vec_xbar[iteration] = vec_unif.mean()
plt.hist(vec_xbar)
plt.title("Distribution of Xbar with size 100")
plt.ylabel("Frequency")
plt.xlabel("Values of Xbar")
plt.show()
An error occurred while executing the following cell:
------------------
# Simulate with sample size 100
sample_size = 100
vec_xbar = [None] * num_simulations
for iteration in num_simulations:
    vec_unif  = np.random.uniform(low = -2, high=2, size = sample_size)
    vec_xbar[iteration] = vec_unif.mean()
plt.hist(vec_xbar)
plt.title("Distribution of Xbar with size 100")
plt.ylabel("Frequency")
plt.xlabel("Values of Xbar")
plt.show()
------------------


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[10], line 4
      2 sample_size = 100
      3 vec_xbar = [None] * num_simulations
----> 4 for iteration in num_simulations:
      5     vec_unif  = np.random.uniform(low = -2, high=2, size = sample_size)
      6     vec_xbar[iteration] = vec_unif.mean()

TypeError: 'int' object is not iterable

Now back to the simulation

  • We can see that the distribution of the sample mean approaches a normal distribution as the sample size increases
  • But we can make the code more efficient by using nested loops
  • We can write a loop to simulate the sample mean for different numbers of simulations
  • The, we write another loop to simulate the sample means

Nested loops

num_simulations = 2000
sample_size_list = [1,10,50,100,200]

for sample_size in sample_size_list:

    # The following command a vector null values, of length "num_simulations"
    vec_xbar = [None] * num_simulations
    
    for iteration in range(num_simulations):
            vec_unif  = np.random.uniform(low = -2, high=2, size = sample_size)
            vec_xbar[iteration] = vec_unif.mean()
    plt.hist(vec_xbar)
    plt.title("Distribution of Xbar when n is " + str(sample_size))
    plt.ylabel("Frequency")
    plt.xlabel("Values of Xbar")
    plt.show()

Try it yourself!

  • Repeat the above simulation with a few changes
  • Use a Chi-square distribution with (df = 1) instead of a normal
  • Appendix 04

Try it yourself!

Last one! 😅

  • Write code that puts all the figures in the same row
  • Appendix 05

And that’s it for today! 🎉

Thank you for your attention! 🙏

Appendix 01

Three plots in the same row

fig, list_subfig = plt.subplots(1, 3, figsize = (15, 3))

# The tight layout option ensures that the axes are not overlapping
plt.tight_layout()

# Adjust the spacing between subplots
plt.subplots_adjust(wspace=0.3)  # Adjust the value as needed

# First Figure

list_subfig[0].hist(x = vec_normal)
list_subfig[0].set_title("Normal Distribution")
list_subfig[0].set_xlabel("Value")
list_subfig[0].set_ylabel("Frequency")

# Second Figure
list_subfig[1].hist(x = vec_unif)
list_subfig[1].set_title("Uniform Distribution")
list_subfig[1].set_xlabel("Value")
list_subfig[1].set_ylabel("Frequency")

# Third Figure
list_subfig[2].hist(x = vec_chisqr)
list_subfig[2].set_title("Chi-square Distribution")
list_subfig[2].set_xlabel("Value")
list_subfig[2].set_ylabel("Frequency")
plt.show()

Back to exercise

Appendix 01

What happens if you remove the “plt.tight_layout()” command?

fig, list_subfig = plt.subplots(1, 3, figsize = (15, 3))

# Adjust the spacing between subplots
plt.subplots_adjust(wspace=0.3)  # Adjust the value as needed

# First Figure
list_subfig[0].hist(x = vec_normal)
list_subfig[0].set_title("Normal Distribution")
list_subfig[0].set_xlabel("Value")
list_subfig[0].set_ylabel("Frequency")

# Second Figure
list_subfig[1].hist(x = vec_unif)
list_subfig[1].set_title("Uniform Distribution")
list_subfig[1].set_xlabel("Value")
list_subfig[1].set_ylabel("Frequency")

# Third Figure
list_subfig[2].hist(x = vec_chisqr)
list_subfig[2].set_title("Chi-square Distribution")
list_subfig[2].set_xlabel("Value")
list_subfig[2].set_ylabel("Frequency")
plt.show()

Back to exercise

Appendix 01

What happens if you change the “figsize”?

fig, list_subfig = plt.subplots(1, 3, figsize = (10, 3))

# The tight layout option ensures that the axes are not overlapping
plt.tight_layout()

# Adjust the spacing between subplots
plt.subplots_adjust(wspace=0.3)  # Adjust the value as needed

# First Figure
list_subfig[0].hist(x = vec_normal)
list_subfig[0].set_title("Normal Distribution")
list_subfig[0].set_xlabel("Value")
list_subfig[0].set_ylabel("Frequency")

# Second Figure
list_subfig[1].hist(x = vec_unif)
list_subfig[1].set_title("Uniform Distribution")
list_subfig[1].set_xlabel("Value")
list_subfig[1].set_ylabel("Frequency")

# Third Figure
list_subfig[2].hist(x = vec_chisqr)
list_subfig[2].set_title("Chi-square Distribution")
list_subfig[2].set_xlabel("Value")
list_subfig[2].set_ylabel("Frequency")
plt.show()

Back to exercise

Appendix 02

Create a list with the numbers from 0 to 9

print(list(range(10)))

print(list(range(1, 10, 2)))
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[1, 3, 5, 7, 9]

Back to exercise

Appendix 03

Why do we have to use range(num_simulations) here?

  • The reason why we have to use range(num_simulations) is that the for loop expects an iterable object
  • If we use num_simulations, which is an integer (2000), we will get a TypeError because an integer is not iterable
  • Python sees the variable as only the number 2000, not a range of numbers from 0 to 2000
  • Therefore, the for loop will not work as expected and we need to use range(num_simulations) to iterate over the numbers from 0 to 1999
  • Is it clear now? 🤓

Back to exercise

Appendix 04

Repeat the simulation with a Chi-square distribution with df = 1

num_simulations = 2000
sample_size_list = [1,10,50,100,200]

for sample_size in sample_size_list:

    # The following command a vector null values, of length "num_simulations"
    vec_xbar = [None] * num_simulations
    
    for iteration in range(num_simulations):
            vec_chisqr  = np.random.chisquare(df = 1, size = sample_size)
            vec_xbar[iteration] = vec_chisqr.mean()
    plt.hist(vec_xbar)
    plt.title("Distribution of Xbar when n is " + str(sample_size))
    plt.ylabel("Frequency")
    plt.xlabel("Values of Xbar")
    plt.show()

Back to exercise

Appendix 05

Put all the figures in the same row

fig, list_subfig = plt.subplots(1, 5, figsize = (20, 3))

# The tight layout option ensures that the axes are not overlapping
plt.tight_layout()

# Adjust the spacing between subplots
plt.subplots_adjust(wspace=0.3)  # Adjust the value as needed

# Start the loop
num_simulations = 2000
sample_size_list = [1,10,50,100,200]

index = 0
for sample_size in sample_size_list:

    # The following command a vector null values, of length "num_simulations"
    vec_xbar = [None] * num_simulations
    
    for iteration in range(num_simulations):
            vec_chisqr  = np.random.chisquare(df = 1, size = sample_size)
            vec_xbar[iteration] = vec_chisqr.mean()
    list_subfig[index].hist(vec_xbar)
    list_subfig[index].set_title("Distribution of Xbar when n is " + str(sample_size))
    list_subfig[index].set_ylabel("Frequency")
    list_subfig[index].set_xlabel("Values of Xbar")
    index = index + 1
    plt.show()

Back to exercise