Lecture 20 - Interactive Data Analysis with Jupyter on AWS EC2
Objective:
In this comprehensive session, you will use a remote cloud server as a fully interactive data science workbench.
Process:
You will launch an EC2 instance, set up a remote Jupyter Notebook environment, download and clean a dataset interactively, perform an analysis with visualisation, and then retrieve your final work.
Environment:
This workflow uses a single AWS EC2 instance managed from your local terminal.
Estimated Time:
45-60 minutes.
Launch an EC2 Instance:
Log into the AWS Console and launch one t2.micro
instance. Use the Ubuntu Server 24.04 LTS
image and name it jupyter-ec2-instance
.
Use your existing qtm350_key.pem
key pair and configure the security group to allow SSH traffic.
Connect and Install Software:
Connect to your instance using ssh
.
Once connected, install wget
, pandas
, matplotlib
, seaborn
, and Jupyter on the instance.
Start a Port-Forwarding SSH Session:
Open a new local terminal window and run the following command. This maps port 8000 on your local machine to port 8888 on the EC2 instance.
Start the Jupyter Notebook Server:
In your original SSH terminal connected to EC2, start the Jupyter server. The --no-browser
flag is important.
Access Jupyter in Your Browser:
Jupyter will print a URL with a security token. Copy this token.
Open a web browser on your local machine and navigate to http://localhost:8000
.
Paste the token into the password box to log in.
You are now running a Jupyter Notebook session on a remote cloud server. All work from this point will be done in notebook cells.
Create a New Notebook:
In the Jupyter file browser, click New
-> Python 3 (ipykernel)
to create a new notebook. Rename it employee_analysis.ipynb
.
Download the Raw Data:
In the first cell of your new notebook, use the wget
command to download the dirty_employee_data
to your EC2 instance. More information about the command here.
The link is https://github.com/danilofreire/qtm350-summer/blob/main/lectures/lecture-20/dirty_employee_data.csv.
If you need to create the dataset, you can find the script here: https://github.com/danilofreire/qtm350-summer/blob/main/lecture-20/dirty-data-generation.py.
Load and Inspect the Data:
In the next cell, load the data using pandas and use df.info()
to inspect its structure and identify problems.
.replace()
method to merge ‘HR’/‘hr’ into ‘Human Resources’ and ‘Tech’ into ‘Technology’..groupby()
and .transform()
to fill missing salary values with the median salary of that employee’s department.start_date
is missing using .dropna()
.start_date
column to a proper datetime format using pd.to_datetime()
.department_map = {'HR': 'Human Resources', 'hr': 'Human Resources', 'IT': 'Technology', 'Tech': 'Technology'}
df['department'] = df['department'].replace(department_map)
df['salary'] = df.groupby('department')['salary'].transform(lambda x: x.fillna(x.median()))
df.dropna(subset=['start_date'], inplace=True)
df['start_date'] = pd.to_datetime(df['start_date'])
print("Data after cleaning:")
df.info()
sns.countplot()
to create a bar chart showing the number of employees in each department. Here is an example of the code:Save the Cleaned Data:
In the final cell of your notebook, save the cleaned DataFrame to a new CSV file.
Download Your Files:
In the Jupyter interface in your browser, first save the notebook (File
-> Save Notebook
).
Then, from your local terminal, use scp
to download both your completed notebook and the new clean CSV file.
Ctrl+C
twice to stop it, then go to the AWS Console and terminate the jupyter-ec2-instance
.EC2 Interactive Practice