Installing packages and working with variables and lists
Python is a versatile programming language, but it doesn’t come with all the tools we need
Packages are collections of functions that extend Python’s capabilities
There are thousands of packages available, and we can install them using conda install or pip install
We will also learn about variables and lists
Variables are containers that store data values
Lists are collections of items that can be of different types
Today, we will learn how to create, access, and modify variables and lists
Python environments
What is a Python environment?
A Python environment is a self-contained directory that contains a specific Python interpreter and a collection of packages
It allows you to manage dependencies and avoid conflicts between different projects
You can create multiple environments with different versions of Python and packages
Each environment is isolated from others, ensuring that changes in one environment do not affect others
You can create, activate, and deactivate environments using the command line or Anaconda Navigator
The default environment is called base, and it is created when you install Anaconda
Creating and managing environments with the command line
For this course, we will use the base environment
But you are free to create your own environments if you want!
There are two ways to create environments:
Using the command line: conda create -n qtm151 python=3.12
Using Anaconda Navigator: Go to the “Environments” tab and click on “Create”
After creating an environment in the command line, you can activate it using the command line: conda activate qtm151
To remove a package from an environment, use the command line: conda remove package_name
To deactivate an environment, use the command line: conda deactivate
To remove an environment, use the command line: conda remove -n qtm151 --all
# Create a new environment called qtm151conda create -n qtm151 python=3.12# Activate the qtm151 environmentconda activate qtm151# Install the required packagesconda install numpy pandas matplotlib jupyter# Deactivate the current environmentconda deactivate# Remove the qtm151 environmentconda remove -n qtm151 --all
We will use Visual Studio Code (VSCode) as our code editor
VSCode is a lightweight, open-source code editor that supports many programming languages, including Python
It has excellent support for Jupyter Notebooks, making it a great choice for data science projects
Ensure VS Code Uses Your Anaconda Python (Kernel Selection):
When you open or create a .ipynb file in VS Code, it needs to know which Python installation (kernel) to use.
Look for the kernel indicator (usually top-right). If it’s not set or incorrect (e.g., not pointing to your Anaconda Python), click it.
Choose “Python Environments” and select the Python version associated with your Anaconda installation.
Creating a new Jupyter Notebook in VS Code
Option 1: Via File Menu - Go to “File” > “New File…”. - Select “Jupyter Notebook” from the options. - Save the new file with a .ipynb extension (e.g., lecture_notes.ipynb).
Option 2: Via Command Palette - Press Cmd + Shift + P (macOS) or Ctrl + Shift + P (Windows/Linux). - Type “Create: New Jupyter Notebook” and select it. - Again, ensure the correct kernel is selected for this new notebook.
Let’s start coding: importing packages
Loading Packages: the import statement
Before you can use functions from an installed package in your notebook, you must import it into your current session
It’s a strong convention to use common aliases (nicknames) for widely used packages using the as keyword. This improves code readability and conciseness
import pandas as pd
import matplotlib.pyplot as plt (pyplot is the main plotting module from Matplotlib)
import numpy as np
These nicknames are widely recognised in the data science community, so using them makes your code more understandable to others
You can also import specific functions from a package using the from keyword
from pandas import DataFrame
from matplotlib.pyplot import plot
This is useful if you only need a few functions from a package and want to avoid loading the entire package
Loading packages: in practice
Let’s import the “holy trinity” of data science packages in Python:
# "matplotlib.pyplot" is the primary interface for plotting, aliased as 'plt'import matplotlib.pyplot as plt# "pandas" is used for data manipulation and analysis, aliased as 'pd'import pandas as pd# "numpy" is for numerical operations, especially with arrays, aliased as 'np'import numpy as npprint("Packages (matplotlib.pyplot, pandas, numpy) loaded successfully!")
The pd.read_csv() function is used to load data from a Comma Separated Values (CSV) file into a pandas DataFrame
Windows users, if you encounter issues with your file path, try using double backslashes \\ or a raw string r'path' to avoid escape character issues (e.g., 'C:\\path\\to\\file.csv' or r'C:\path\to\file.csv')
# The pd.read_csv() function reads a CSV file.# The result (a DataFrame) is stored in the variable 'carfeatures'.# Ensure 'data/features.csv' is in the correct path relative to your notebook.carfeatures = pd.read_csv('data/features.csv')carfeatures.head() # Display the first few rows of the DataFrame
mpg
cylinders
displacement
horsepower
weight
acceleration
vehicle id
0
18.0
8
307
130
3504
12.0
C-1689780
1
15.0
8
350
165
3693
11.5
B-1689791
2
18.0
8
318
150
3436
11.0
P-1689802
3
16.0
8
304
150
3433
12.0
A-1689813
4
17.0
8
302
140
3449
10.5
F-1689824
Viewing Your DataFrame in VS Code
VS Code offers several ways to inspect your DataFrame:
Jupyter Variables Tab:
Look for a “Variables” icon or section in the Jupyter interface within VS Code
Clicking it shows active variables; double-clicking a DataFrame (like carfeatures) opens it in a new tab for viewing
This extension provides more advanced tools for data viewing and cleaning
You might find a “View Data” button or a right-click option on the DataFrame variable
Running basic analyses
Displaying the carfeatures dataframe
Typing the name of a DataFrame in a code cell and running it will display its contents
For large DataFrames, Jupyter typically shows a summary (the first and last few rows).
carfeatures
mpg
cylinders
displacement
horsepower
weight
acceleration
vehicle id
0
18.0
8
307
130
3504
12.0
C-1689780
1
15.0
8
350
165
3693
11.5
B-1689791
2
18.0
8
318
150
3436
11.0
P-1689802
3
16.0
8
304
150
3433
12.0
A-1689813
4
17.0
8
302
140
3449
10.5
F-1689824
...
...
...
...
...
...
...
...
393
27.0
4
140
86
2790
15.6
F-1694103
394
44.0
4
97
52
2130
24.6
V-1694114
395
32.0
4
135
84
2295
11.6
D-1694125
396
28.0
4
120
79
2625
18.6
F-1694136
397
31.0
4
119
82
2720
19.4
C-1694147
398 rows × 7 columns
Displaying the carfeatures dataframe
You can also use the .head() method to display the first few rows of a DataFrame
# Display the first 5 rows of the DataFramecarfeatures.head() # Default is 5 rows
mpg
cylinders
displacement
horsepower
weight
acceleration
vehicle id
0
18.0
8
307
130
3504
12.0
C-1689780
1
15.0
8
350
165
3693
11.5
B-1689791
2
18.0
8
318
150
3436
11.0
P-1689802
3
16.0
8
304
150
3433
12.0
A-1689813
4
17.0
8
302
140
3449
10.5
F-1689824
To view the last few rows, use .tail()
# Display the last 5 rows of the DataFramecarfeatures.tail() # Default is 5 rows
mpg
cylinders
displacement
horsepower
weight
acceleration
vehicle id
393
27.0
4
140
86
2790
15.6
F-1694103
394
44.0
4
97
52
2130
24.6
V-1694114
395
32.0
4
135
84
2295
11.6
D-1694125
396
28.0
4
120
79
2625
18.6
F-1694136
397
31.0
4
119
82
2720
19.4
C-1694147
Selecting and displaying a single column
To select a single column from a DataFrame, use the column name in square brackets []
This returns a pandas Series (a one-dimensional array-like object)
# Extracting the 'cylinders' column.carfeatures['cylinders'].head()
0 8
1 8
2 8
3 8
4 8
Name: cylinders, dtype: int64
You can also use the dot notation (if the column name doesn’t contain spaces or special characters)
# Using dot notationcarfeatures.cylinders.head()
0 8
1 8
2 8
3 8
4 8
Name: cylinders, dtype: int64
Note: Dot notation is less flexible than bracket notation, especially for column names with spaces or special characters
Example: Computing a frequency table
To count occurrences of unique values in a column, use pd.crosstab()
This function creates a cross-tabulation of two (or more) variables
The first argument is the column to be counted, and the second is a custom title for the resulting table
# crosstab counts how many rows fall into categories# "index" is the category# "columns" is a custom titletable = pd.crosstab(index = carfeatures['cylinders'], columns ="count")table
col_0
count
cylinders
3
4
4
204
5
3
6
84
8
103
Example: Computing a frequency table
The column name in the crosstab table is set to col_0 by default
table.columns.name
'col_0'
We can rename the column too
table.columns.name ='column name'table
column name
count
cylinders
3
4
4
204
5
3
6
84
8
103
Example: Generating basic summary statistics
The .describe() method on a DataFrame provides key descriptive statistics for all numerical columns
# .describe() computes: count, mean, standard deviation, min, 25th/50th/75th percentiles, and max# It automatically ignores non-numeric columnscarfeatures.describe()
mpg
cylinders
displacement
weight
acceleration
count
398.000000
398.000000
398.000000
398.000000
398.000000
mean
23.514573
5.454774
193.427136
2970.424623
15.568090
std
7.815984
1.701004
104.268683
846.841774
2.757689
min
9.000000
3.000000
68.000000
1613.000000
8.000000
25%
17.500000
4.000000
104.250000
2223.750000
13.825000
50%
23.000000
4.000000
148.500000
2803.500000
15.500000
75%
29.000000
8.000000
262.000000
3608.000000
17.175000
max
46.600000
8.000000
455.000000
5140.000000
24.800000
Note
The horsepower column might be missing from .describe() if it was read as non-numeric (e.g., due to placeholder characters like ‘?’). Data cleaning would be needed to convert it.
Python variables & data types 📊
Variables: Named containers for data
Variables are used to store data values.
Python is dynamically typed, meaning you don’t need to declare the type of a variable.
Common Data Types:
Integers (int): Whole numbers (e.g., x = 10).
Floats (float): Numbers with decimals (e.g., pi = 3.14).
Strings (str): Text, enclosed in single ' ' or double " " quotes (e.g., message = "Hello").
Booleans (bool): Logical values, either True or False.
Exercise: Define variables for your name and major, then print a concatenated string introducing yourself
Python lists 📝
Lists: Ordered, mutable collections
Lists are versatile and widely used to store multiple items in a single variable.
Characteristics:
Created using square brackets [], items separated by commas.
Items can be of different data types (e.g., numbers, strings, even other lists).
Ordered: Items maintain their position/sequence.
Mutable: You can change, add, or remove items after the list is created.
Access items using their index (position).
Very important: Python indexing starts at 0! The first item is at index 0.
Lists: Ordered, mutable collections
# List of integersprime_numbers = [2, 3, 5, 7, 11]# List of strings (e.g., survey responses)fav_colours = ["blue", "green", "blue", "red", "yellow"]print(f"Prime Numbers: {prime_numbers}, Type: {type(prime_numbers)}")print(f"Favourite Colours: {fav_colours}")# List with mixed data typesmixed_data_list = ["Python", 3.11, 42, True]# Lists can be nested (a list containing other lists)nested_list_example = [[1, 2], ["a", "b"], mixed_data_list]nested_list_example
The inner list ["Python", 3.11, 42, True] is at nested_list_example[2]
Python is the first element (index 0) of that inner list
nested_list_example[2][0]
'Python'
Try it yourself!
Create a list containing the titles of your three favourite films
Print the last film using both positive and negative indexing
Visualising data with matplotlib 📊
Visualising list data: histograms
matplotlib.pyplot (imported as plt) is Python’s primary plotting library
plt.hist() creates a histogram. This is useful for visualising the distribution of a single set of numerical data or the frequency of categorical items in a list
plt.show() displays the generated plot
Always add labels and a title for clarity:
plt.title("My Plot Title")
plt.xlabel("X-axis Label")
plt.ylabel("Y-axis Label")
# import matplotlib.pyplot as plt# fav_colours = ["blue", "green", "blue", "red", "yellow"]colour_survey = fav_colours + ['blue', 'green', 'green']plt.hist(x=colour_survey, color='skyblue')plt.title("Frequency of Favourite Colours in Survey")plt.xlabel("Colour")plt.ylabel("Frequency Count")plt.show()
Try it yourself!
Create a list with repeated string values (e.g., favourite books) and generate a histogram
Ensure you add a title, and labels for the x and y axes!
Visualising relationships: scatter plots with lists
plt.scatter() is used to create a scatter plot.
It requires two lists of equal length: one for the x-coordinates and one for the y-coordinates.
Scatter plots are excellent for visualising the relationship (or lack thereof) between two numerical variables.
x_values_list = [1, 2, 3, 4, 5] y_values_list = [1, 4, 9, 16, 25] plt.scatter(x=x_values_list, y=y_values_list, color='purple')plt.xlabel("X Value (Input Number)")plt.ylabel("Y Value (Squared Number)")plt.title("Scatter Plot: Numbers vs. Their Squares (from Lists)")plt.grid(True)plt.show()
Scatter plots directly from dataframe columns
You can create scatter plots using columns from your pandas DataFrames
The x and y parameters can be set to the names of the columns you want to plot
Ensure the DataFrame is not empty and contains the specified columns!
# Using the 'carfeatures' DataFrameplt.scatter(x=carfeatures['weight'], y=carfeatures['mpg'], alpha=0.7, color='green')plt.xlabel("Weight of the Car (lbs)")plt.ylabel("Miles Per Gallon (MPG)")plt.title("Scatter Plot: Car Weight vs. MPG")plt.grid(True)plt.show()
Suggestion: Try another scatter plot using different columns, e.g., x=carfeatures['acceleration'] vs y=carfeatures['mpg'].
Try it yourself!
Create two lists of numbers representing, for example, hours studied and exam scores.
Generate your own scatter plot from these lists.
Ensure you add a title, and labels for the x and y axes.
Define variables for your name and major, then print an introduction
user_name ="Charlie"# Variable for nameuser_major ="Quantitative Sciences"# Variable for major# Concatenating strings to form an introductionintroduction_message ="My name is "+ user_name +" and I am majoring in "+ user_major +"."print(introduction_message)# Alternative using an f-string (formatted string literal - often preferred for readability)print(f"My name is {user_name} and I am majoring in {user_major}.")
My name is Charlie and I am majoring in Quantitative Sciences.
My name is Charlie and I am majoring in Quantitative Sciences.
Appendix - Solutions
Create a list of your three favourite films and print the last one
favourite_films_list = ["Pulp Fiction", "Blade Runner 2049", "The Godfather"]# Print the last film using positive indexing# Length of the list is 3, so the index of the last element is 2 (3 - 1)print(favourite_films_list[2])# Print the last film using negative indexing (-1 always refers to the last element)print(favourite_films_list[-1])
The Godfather
The Godfather
Appendix - Solutions
Create a list of your favourite books and generate a histogram
# A list of favourite books, with some titles repeatedfavourite_books_survey = ["The Illiad", "1984", "Brave New World", "The Illiad","The Odyssey", "1984", "The Odyssey", "The Illiad"]plt.hist(x=favourite_books_survey, color='teal', edgecolor='black') # rwidth adds spacingplt.title("Frequency of Favourite Books Mentioned")plt.xlabel("Book Title")plt.ylabel("Number of Mentions")plt.show()
Appendix - Solutions
Create two lists of numbers (e.g., hours studied & exam scores) and make a scatter plot
# Example data: hours studied vs. exam scoreshours_studied_data = [1, 2, 2.5, 3, 4, 4.5, 5, 6, 7, 8]exam_scores_data = [60, 65, 68, 70, 78, 80, 85, 88, 92, 95]plt.scatter(x=hours_studied_data, y=exam_scores_data, color='crimson', marker='^') # '^' for triangle markersplt.xlabel("Hours Studied This Week")plt.ylabel("Exam Score (out of 100)")plt.title("Relationship: Study Hours vs. Exam Scores")plt.xlim(0, 9) # Set x-axis limitsplt.ylim(50, 100) # Set y-axis limitsplt.grid(True, linestyle='--', alpha=0.7) # Add a styled gridplt.show()