QTM 151 - Introduction to Statistical Computing II

Jupyter Notebooks, Packages, Variables, and Lists

Danilo Freire

Emory University

Nice to see you all again! 😊

Today’s agenda

Installing packages and working with variables and lists

  • Python is a versatile programming language, but it doesn’t come with all the tools we need
  • Packages are collections of functions that extend Python’s capabilities
  • There are thousands of packages available, and we can install them using conda install or pip install
  • We will also learn about variables and lists
  • Variables are containers that store data values
  • Lists are collections of items that can be of different types
  • Today, we will learn how to create, access, and modify variables and lists

Python environments

What is a Python environment?

  • A Python environment is a self-contained directory that contains a specific Python interpreter and a collection of packages
  • It allows you to manage dependencies and avoid conflicts between different projects
  • You can create multiple environments with different versions of Python and packages
  • Each environment is isolated from others, ensuring that changes in one environment do not affect others
  • You can create, activate, and deactivate environments using the command line or Anaconda Navigator
  • The default environment is called base, and it is created when you install Anaconda

Creating and managing environments with the command line

  • For this course, we will use the base environment
  • But you are free to create your own environments if you want!
  • There are two ways to create environments:
    • Using the command line: conda create -n qtm151 python=3.12
    • Using Anaconda Navigator: Go to the “Environments” tab and click on “Create”
  • After creating an environment in the command line, you can activate it using the command line: conda activate qtm151
  • To remove a package from an environment, use the command line: conda remove package_name
  • To deactivate an environment, use the command line: conda deactivate
  • To remove an environment, use the command line: conda remove -n qtm151 --all
# Create a new environment called qtm151
conda create -n qtm151 python=3.12

# Activate the qtm151 environment
conda activate qtm151

# Install the required packages
conda install numpy pandas matplotlib jupyter

# Deactivate the current environment
conda deactivate

# Remove the qtm151 environment
conda remove -n qtm151 --all
conda update conda
conda init

Python packages 📦

Installing packages

  • There are several ways to install packages in Python
  • The two most common ways are pip and conda
  • pip is the Python package installer, which comes pre-installed with Python
  • conda is the package manager that comes with Anaconda, and it is even more user-friendly
  • We will use conda to install packages in this course
  • You can install packages using the command conda install package in the terminal or go to the Anaconda Navigator and install them from there
  • In Anaconda Navigator, you can search for packages in the “Environments” tab
  • The main packages we will use are:
    • numpy: for numerical computing
    • pandas: for data manipulation
    • matplotlib: for data visualisation
  • You should have them installed already, as they come with the base Anaconda installation
  • If not, please try to install them using the Anaconda Navigator

Tip

Setting up your workspace: VSCode & Jupyter

  • We will use Visual Studio Code (VSCode) as our code editor

  • VSCode is a lightweight, open-source code editor that supports many programming languages, including Python

  • It has excellent support for Jupyter Notebooks, making it a great choice for data science projects

  • Ensure VS Code Uses Your Anaconda Python (Kernel Selection):

    • When you open or create a .ipynb file in VS Code, it needs to know which Python installation (kernel) to use.
    • Look for the kernel indicator (usually top-right). If it’s not set or incorrect (e.g., not pointing to your Anaconda Python), click it.
    • Choose “Python Environments” and select the Python version associated with your Anaconda installation.

VS Code Python Kernel Selection

Creating a new Jupyter Notebook in VS Code

Option 1: Via File Menu - Go to “File” > “New File…”. - Select “Jupyter Notebook” from the options. - Save the new file with a .ipynb extension (e.g., lecture_notes.ipynb).

VS Code New File Menu

Option 2: Via Command Palette - Press Cmd + Shift + P (macOS) or Ctrl + Shift + P (Windows/Linux). - Type “Create: New Jupyter Notebook” and select it. - Again, ensure the correct kernel is selected for this new notebook.

VS Code Command Palette for New Jupyter Notebook

Let’s start coding: importing packages

Loading Packages: the import statement

  • Before you can use functions from an installed package in your notebook, you must import it into your current session
  • It’s a strong convention to use common aliases (nicknames) for widely used packages using the as keyword. This improves code readability and conciseness
    • import pandas as pd
    • import matplotlib.pyplot as plt (pyplot is the main plotting module from Matplotlib)
    • import numpy as np
  • These nicknames are widely recognised in the data science community, so using them makes your code more understandable to others
  • You can also import specific functions from a package using the from keyword
    • from pandas import DataFrame
    • from matplotlib.pyplot import plot
  • This is useful if you only need a few functions from a package and want to avoid loading the entire package

Loading packages: in practice

  • Let’s import the “holy trinity” of data science packages in Python:
# "matplotlib.pyplot" is the primary interface for plotting, aliased as 'plt'
import matplotlib.pyplot as plt

# "pandas" is used for data manipulation and analysis, aliased as 'pd'
import pandas as pd

# "numpy" is for numerical operations, especially with arrays, aliased as 'np'
import numpy as np

print("Packages (matplotlib.pyplot, pandas, numpy) loaded successfully!")
Packages (matplotlib.pyplot, pandas, numpy) loaded successfully!

Opening datasets with pandas: read_csv()

  • Pandas is excellent for working with tabular data
  • The pd.read_csv() function is used to load data from a Comma Separated Values (CSV) file into a pandas DataFrame
  • Windows users, if you encounter issues with your file path, try using double backslashes \\ or a raw string r'path' to avoid escape character issues (e.g., 'C:\\path\\to\\file.csv' or r'C:\path\to\file.csv')
# The pd.read_csv() function reads a CSV file.
# The result (a DataFrame) is stored in the variable 'carfeatures'.
# Ensure 'data/features.csv' is in the correct path relative to your notebook.
carfeatures = pd.read_csv('data/features.csv')

carfeatures.head()  # Display the first few rows of the DataFrame
mpg cylinders displacement horsepower weight acceleration vehicle id
0 18.0 8 307 130 3504 12.0 C-1689780
1 15.0 8 350 165 3693 11.5 B-1689791
2 18.0 8 318 150 3436 11.0 P-1689802
3 16.0 8 304 150 3433 12.0 A-1689813
4 17.0 8 302 140 3449 10.5 F-1689824

Viewing Your DataFrame in VS Code

VS Code offers several ways to inspect your DataFrame:

  • Jupyter Variables Tab:
    • Look for a “Variables” icon or section in the Jupyter interface within VS Code
    • Clicking it shows active variables; double-clicking a DataFrame (like carfeatures) opens it in a new tab for viewing Jupyter Toolbar with Variables button DataFrame viewed in VS Code tab
  • Data Wrangler Extension (if installed):
    • This extension provides more advanced tools for data viewing and cleaning
    • You might find a “View Data” button or a right-click option on the DataFrame variable Data Wrangler button Data Wrangler interface

Running basic analyses

Displaying the carfeatures dataframe

  • Typing the name of a DataFrame in a code cell and running it will display its contents
  • For large DataFrames, Jupyter typically shows a summary (the first and last few rows).
carfeatures
mpg cylinders displacement horsepower weight acceleration vehicle id
0 18.0 8 307 130 3504 12.0 C-1689780
1 15.0 8 350 165 3693 11.5 B-1689791
2 18.0 8 318 150 3436 11.0 P-1689802
3 16.0 8 304 150 3433 12.0 A-1689813
4 17.0 8 302 140 3449 10.5 F-1689824
... ... ... ... ... ... ... ...
393 27.0 4 140 86 2790 15.6 F-1694103
394 44.0 4 97 52 2130 24.6 V-1694114
395 32.0 4 135 84 2295 11.6 D-1694125
396 28.0 4 120 79 2625 18.6 F-1694136
397 31.0 4 119 82 2720 19.4 C-1694147

398 rows × 7 columns

Displaying the carfeatures dataframe

  • You can also use the .head() method to display the first few rows of a DataFrame
# Display the first 5 rows of the DataFrame
carfeatures.head()  # Default is 5 rows
mpg cylinders displacement horsepower weight acceleration vehicle id
0 18.0 8 307 130 3504 12.0 C-1689780
1 15.0 8 350 165 3693 11.5 B-1689791
2 18.0 8 318 150 3436 11.0 P-1689802
3 16.0 8 304 150 3433 12.0 A-1689813
4 17.0 8 302 140 3449 10.5 F-1689824
  • To view the last few rows, use .tail()
# Display the last 5 rows of the DataFrame
carfeatures.tail()  # Default is 5 rows
mpg cylinders displacement horsepower weight acceleration vehicle id
393 27.0 4 140 86 2790 15.6 F-1694103
394 44.0 4 97 52 2130 24.6 V-1694114
395 32.0 4 135 84 2295 11.6 D-1694125
396 28.0 4 120 79 2625 18.6 F-1694136
397 31.0 4 119 82 2720 19.4 C-1694147

Selecting and displaying a single column

  • To select a single column from a DataFrame, use the column name in square brackets []
  • This returns a pandas Series (a one-dimensional array-like object)
# Extracting the 'cylinders' column.
carfeatures['cylinders'].head()
0    8
1    8
2    8
3    8
4    8
Name: cylinders, dtype: int64
  • You can also use the dot notation (if the column name doesn’t contain spaces or special characters)
# Using dot notation
carfeatures.cylinders.head()
0    8
1    8
2    8
3    8
4    8
Name: cylinders, dtype: int64
  • Note: Dot notation is less flexible than bracket notation, especially for column names with spaces or special characters

Example: Computing a frequency table

  • To count occurrences of unique values in a column, use pd.crosstab()
  • This function creates a cross-tabulation of two (or more) variables
  • The first argument is the column to be counted, and the second is a custom title for the resulting table
# crosstab counts how many rows fall into categories
# "index" is the category
# "columns" is a custom title

table = pd.crosstab(index = carfeatures['cylinders'], columns = "count")
table
col_0 count
cylinders
3 4
4 204
5 3
6 84
8 103

Example: Computing a frequency table

  • The column name in the crosstab table is set to col_0 by default
table.columns.name
'col_0'
  • We can rename the column too
table.columns.name = 'column name'
table
column name count
cylinders
3 4
4 204
5 3
6 84
8 103

Example: Generating basic summary statistics

  • The .describe() method on a DataFrame provides key descriptive statistics for all numerical columns
# .describe() computes: count, mean, standard deviation, min, 25th/50th/75th percentiles, and max
# It automatically ignores non-numeric columns
carfeatures.describe()
mpg cylinders displacement weight acceleration
count 398.000000 398.000000 398.000000 398.000000 398.000000
mean 23.514573 5.454774 193.427136 2970.424623 15.568090
std 7.815984 1.701004 104.268683 846.841774 2.757689
min 9.000000 3.000000 68.000000 1613.000000 8.000000
25% 17.500000 4.000000 104.250000 2223.750000 13.825000
50% 23.000000 4.000000 148.500000 2803.500000 15.500000
75% 29.000000 8.000000 262.000000 3608.000000 17.175000
max 46.600000 8.000000 455.000000 5140.000000 24.800000

Note

The horsepower column might be missing from .describe() if it was read as non-numeric (e.g., due to placeholder characters like ‘?’). Data cleaning would be needed to convert it.

Python variables & data types 📊

Variables: Named containers for data

  • Variables are used to store data values.
  • Python is dynamically typed, meaning you don’t need to declare the type of a variable.
  • Common Data Types:
    • Integers (int): Whole numbers (e.g., x = 10).
    • Floats (float): Numbers with decimals (e.g., pi = 3.14).
    • Strings (str): Text, enclosed in single ' ' or double " " quotes (e.g., message = "Hello").
    • Booleans (bool): Logical values, either True or False.
  • Use type() to check a variable’s data type.
  • Use print() to display a variable’s value.
age = 30                            # Integer
price = 19.99                       # Float
student_name = "Alice"              # String
is_enrolled = True                  # Boolean

print(f"Value: {age}, Type: {type(age)}")
print(f"Value: {price}, Type: {type(price)}")
print(f"Value: {student_name}, Type: {type(student_name)}")
print(f"Value: {is_enrolled}, Type: {type(is_enrolled)}")
Value: 30, Type: <class 'int'>
Value: 19.99, Type: <class 'float'>
Value: Alice, Type: <class 'str'>
Value: True, Type: <class 'bool'>

Storing variables in memory & naming conventions

  • Assignment: Use the = operator (e.g., user_count = 150).
  • Naming Rules & Best Practices:
    • Must start with a letter or underscore (_).
    • Can contain letters, numbers, and underscores.
    • Case-sensitive (myVariable is different from myvariable).
    • Cannot be a Python keyword (e.g., if, for, class).
    • Use descriptive names (e.g., average_score instead of avg).
    • Convention: snake_case (lowercase with underscores) for variables and functions.
  • View active variables in VS Code Jupyter using the “Variables” panel or Data Wrangler.
count_apples = 5
item_price = 2.50
user_greeting = "Welcome back!"

print(count_apples)
print(user_greeting)
5
Welcome back!
  • Try it yourself!
  • Exercise: Create a variable to store the title of your favourite film and print it!

Basic operations with variables

Numeric Operations:

  • Addition: +
  • Subtraction: -
  • Multiplication: *
  • Division: / (results in a float)
  • Floor Division: // (discards decimal)
  • Modulus: % (remainder)
  • Exponentiation: ** (e.g., 2**3 is 8)
  • Use parentheses () to control the order of operations.
total_cost = 10 * 1.08  # Price * tax
print(f"Total cost: {total_cost}")

base_number = 4
calculation = (base_number + 6) / 2
print(f"Calculation result: {calculation}")
Total cost: 10.8
Calculation result: 5.0

String Concatenation:

  • Use the + operator to join strings.
first_name = "Danilo"
last_name = "Freire"
full_name = first_name + " " + last_name
print(full_name)
Danilo Freire
  • Try it yourself!
  • Exercise: Define variables for your name and major, then print a concatenated string introducing yourself

Python lists 📝

Lists: Ordered, mutable collections

  • Lists are versatile and widely used to store multiple items in a single variable.
  • Characteristics:
    • Created using square brackets [], items separated by commas.
    • Items can be of different data types (e.g., numbers, strings, even other lists).
    • Ordered: Items maintain their position/sequence.
    • Mutable: You can change, add, or remove items after the list is created.
  • Access items using their index (position).
  • Very important: Python indexing starts at 0! The first item is at index 0.

Lists: Ordered, mutable collections

# List of integers
prime_numbers = [2, 3, 5, 7, 11]

# List of strings (e.g., survey responses)
fav_colours = ["blue", "green", "blue", "red", "yellow"]
print(f"Prime Numbers: {prime_numbers}, Type: {type(prime_numbers)}")
print(f"Favourite Colours: {fav_colours}")

# List with mixed data types
mixed_data_list = ["Python", 3.11, 42, True]

# Lists can be nested (a list containing other lists)
nested_list_example = [[1, 2], ["a", "b"], mixed_data_list]
nested_list_example
Prime Numbers: [2, 3, 5, 7, 11], Type: <class 'list'>
Favourite Colours: ['blue', 'green', 'blue', 'red', 'yellow']
[[1, 2], ['a', 'b'], ['Python', 3.11, 42, True]]

Accessing list elements: Indexing

  • Use square brackets [] with the index number after the list’s name.
    • my_list[0] gets the first element.
    • my_list[1] gets the second element.
  • Negative Indexing:
    • my_list[-1] gets the last element.
    • my_list[-2] gets the second-to-last element.
  • For nested lists, use multiple index brackets: nested_list[outer_index][inner_index].
# Using prime_numbers = [2, 3, 5, 7, 11]
print(f"First prime: {prime_numbers[0]}")      # Output: 2
print(f"Third prime: {prime_numbers[2]}")      # Output: 5
print(f"Last prime: {prime_numbers[-1]}")     # Output: 11
First prime: 2
Third prime: 5
Last prime: 11

Accessing list elements: Indexing

  • Using nested_list_example = [[[1, 2], ["a", "b"], ["Python", 3.11, 42, True]]]
  • Let’s access Python from the nested list:
  • The inner list ["Python", 3.11, 42, True] is at nested_list_example[2]
  • Python is the first element (index 0) of that inner list
nested_list_example[2][0]
'Python'
  • Try it yourself!
    • Create a list containing the titles of your three favourite films
    • Print the last film using both positive and negative indexing

Visualising data with matplotlib 📊

Visualising list data: histograms

  • matplotlib.pyplot (imported as plt) is Python’s primary plotting library
  • plt.hist() creates a histogram. This is useful for visualising the distribution of a single set of numerical data or the frequency of categorical items in a list
  • plt.show() displays the generated plot
  • Always add labels and a title for clarity:
    • plt.title("My Plot Title")
    • plt.xlabel("X-axis Label")
    • plt.ylabel("Y-axis Label")
# import matplotlib.pyplot as plt
# fav_colours = ["blue", "green", "blue", "red", "yellow"]
colour_survey = fav_colours + ['blue', 'green', 'green']

plt.hist(x=colour_survey, color='skyblue')
plt.title("Frequency of Favourite Colours in Survey")
plt.xlabel("Colour")
plt.ylabel("Frequency Count")
plt.show()

Try it yourself!

  • Create a list with repeated string values (e.g., favourite books) and generate a histogram
  • Ensure you add a title, and labels for the x and y axes!

Visualising relationships: scatter plots with lists

  • plt.scatter() is used to create a scatter plot.
  • It requires two lists of equal length: one for the x-coordinates and one for the y-coordinates.
  • Scatter plots are excellent for visualising the relationship (or lack thereof) between two numerical variables.
x_values_list = [1, 2, 3, 4, 5]    
y_values_list = [1, 4, 9, 16, 25] 
plt.scatter(x=x_values_list, y=y_values_list, color='purple')
plt.xlabel("X Value (Input Number)")
plt.ylabel("Y Value (Squared Number)")
plt.title("Scatter Plot: Numbers vs. Their Squares (from Lists)")
plt.grid(True)
plt.show()

Scatter plots directly from dataframe columns

  • You can create scatter plots using columns from your pandas DataFrames
  • The x and y parameters can be set to the names of the columns you want to plot
  • Ensure the DataFrame is not empty and contains the specified columns!
# Using the 'carfeatures' DataFrame
plt.scatter(x=carfeatures['weight'], y=carfeatures['mpg'], alpha=0.7, color='green')
plt.xlabel("Weight of the Car (lbs)")
plt.ylabel("Miles Per Gallon (MPG)")
plt.title("Scatter Plot: Car Weight vs. MPG")
plt.grid(True)
plt.show()

Suggestion: Try another scatter plot using different columns, e.g., x=carfeatures['acceleration'] vs y=carfeatures['mpg'].

Try it yourself!

  • Create two lists of numbers representing, for example, hours studied and exam scores.
  • Generate your own scatter plot from these lists.
  • Ensure you add a title, and labels for the x and y axes.

And that’s all for today! 🎉

Any Questions? 🤔

See You Next Time! 🚀

Appendix - Solutions

  • Create a variable with your favourite film
my_favourite_film = "The Godfather"
print(my_favourite_film) 
The Godfather
  • Define variables for your name and major, then print an introduction
user_name = "Charlie" # Variable for name
user_major = "Quantitative Sciences" # Variable for major

# Concatenating strings to form an introduction
introduction_message = "My name is " + user_name + " and I am majoring in " + user_major + "."
print(introduction_message)

# Alternative using an f-string (formatted string literal - often preferred for readability)
print(f"My name is {user_name} and I am majoring in {user_major}.")
My name is Charlie and I am majoring in Quantitative Sciences.
My name is Charlie and I am majoring in Quantitative Sciences.

Appendix - Solutions

  • Create a list of your three favourite films and print the last one
favourite_films_list = ["Pulp Fiction", "Blade Runner 2049", "The Godfather"]

# Print the last film using positive indexing
# Length of the list is 3, so the index of the last element is 2 (3 - 1)
print(favourite_films_list[2])

# Print the last film using negative indexing (-1 always refers to the last element)
print(favourite_films_list[-1])
The Godfather
The Godfather

Appendix - Solutions

  • Create a list of your favourite books and generate a histogram
# A list of favourite books, with some titles repeated
favourite_books_survey = [
    "The Illiad", "1984", "Brave New World", "The Illiad",
    "The Odyssey", "1984", "The Odyssey", "The Illiad"
]

plt.hist(x=favourite_books_survey, color='teal', edgecolor='black') # rwidth adds spacing
plt.title("Frequency of Favourite Books Mentioned")
plt.xlabel("Book Title")
plt.ylabel("Number of Mentions")
plt.show()

Appendix - Solutions

  • Create two lists of numbers (e.g., hours studied & exam scores) and make a scatter plot
# Example data: hours studied vs. exam scores
hours_studied_data = [1, 2, 2.5, 3, 4, 4.5, 5, 6, 7, 8]
exam_scores_data =   [60, 65, 68, 70, 78, 80, 85, 88, 92, 95]
plt.scatter(x=hours_studied_data, y=exam_scores_data, color='crimson', marker='^') # '^' for triangle markers
plt.xlabel("Hours Studied This Week")
plt.ylabel("Exam Score (out of 100)")
plt.title("Relationship: Study Hours vs. Exam Scores")
plt.xlim(0, 9) # Set x-axis limits
plt.ylim(50, 100) # Set y-axis limits
plt.grid(True, linestyle='--', alpha=0.7) # Add a styled grid
plt.show()