DATASCI 350 - Data Science Computing

Lecture 15 - Cloud Computing II

Danilo Freire

Department of Data and Decision Sciences
Emory University

Hello, everyone! 👋

Brief recap 📚

Introduction to cloud computing and AWS

Cloud computing fundamentals:
- On-demand resource delivery with pay-as-you-go pricing models
- Elastic scalability, measured service, resource pooling
- Service models: IaaS (virtualisation), PaaS (managed platforms), SaaS (full applications)
Enterprise case studies:
- Animoto: Scaled from 50 to 3,500 EC2 instances during viral Facebook app growth
- Washington Post: Processed 17,481 pages of Clinton schedules in 9 hours using 200 EC2 instances

On-premise	IaaS	PaaS	SaaS
Application	Application	Application	Application
Middleware	Middleware	Middleware	Middleware
OS	OS	OS	OS
Virtualisation	Virtualisation	Virtualisation	Virtualisation
Servers	Servers	Servers	Servers
Networking	Networking	Networking	Networking

User manages
Provider manages

Introduction to cloud computing and AWS

AWS architecture components:
- EC2: Resizable compute capacity with auto-scaling
- S3: 11x9s durability object storage with lifecycle policies
- RDS: Managed relational databases (PostgreSQL/MySQL, etc.)
- SageMaker: ML workflows with Jupyter integration
Cost management:
- Always configure billing alarms and budget limits!
- Monitor data transfer costs and storage class transitions

Today’s plan 📅

Today’s plan

Using EC2 instances from your local machine
- Launching instances via AWS Console
- Connecting via SSH with key pairs
Essential Linux commands for instance management:
- Package management: apt update, apt upgrade, apt install
- File operations: scp and chmod
Activity 01: Launching a Jupyter notebook on an EC2 instance
- Installing Jupyter and running a Python script
- Forwarding ports with SSH
- Running Jupyter on a remote server
Activity 02: Data analysis on the cloud (time permitting)
- Creating a weather dataset and uploading it to an EC2 instance
- Analysing the data with Python and downloading the results

About Ubuntu (Wikipedia)

Ubuntu (Official Site)

EC2 from your local machine 🖥️

EC2 instances in a nutshell

Just to recap: EC2 is a virtual server in the cloud
Each instance has a full OS and can run any software
- The vast majority of instances run Linux (Ubuntu, Amazon Linux, etc.)
Virtualisation allows multiple instances on the same hardware
- A virtual machine is a software emulation of a physical computer
- Hypervisors manage VMs and allocate resources
Instance types vary in CPU, memory, storage, and network capacity
- T2 instances are general-purpose, M5 are memory-optimised, etc
- You can change instance types on-the-fly
The free tier offers 750 hours of t2.micro instances per month, 1 GB of RAM, and 30 GB of storage

Source: DataCamp

Ubuntu Linux

Linux is a popular open-source operating system
They are popular precisely because they are free, but they are also more secure than Windows and can be customised to your needs (in contrast to MacOS)
The most widely used Linux distribution is Ubuntu
- Debian-based, with a focus on usability and security
- Long-term support (LTS) versions are supported for 5 years
- Desktop and server editions available
Google, Amazon, and Microsoft all offer Ubuntu instances
Windows Subsystem for Linux (WSL) allows you to run Ubuntu on Windows
For our purposes, we will use Ubuntu instances on AWS

Ubuntu Desktop

All Linux distributions share a common set of commands and use bash as the default shell
Therefore, all commands we have learnt so far will work on Ubuntu

Ubuntu server

While Ubuntu is a complete OS (with a GUI), we will use the terminal to interact with our instances
Why should we only use the terminal?
- A bare-bones Ubuntu instance is cheaper and faster than a full desktop
- And as you already know, the terminal has everything you need to manage your instance and run your code
Again, all commands are the same (ls, cd, pwd, etc)
However, you have to use sudo to run commands as root (superuser)
- This is because root has full control over the system and can do anything
- Be careful with root!
To install software, use apt (Advanced Package Tool)
apt update refreshes the package list
apt install installs a package

Ubuntu Terminal

sudo apt install python3 installs Python 3
sudo apt install python3-pip installs pip

Windows Subsystem for Linux (WSL)

WSL is nothing but Ubuntu running natively on Windows!
I hope many of you have already installed it! 😊
If you work with WSL, make sure to store your SSH key file in your Ubuntu home directory
That’s because Windows FS does not work well with chmod 400 (which is required to set the correct permissions on your key file)
To locate your /home/user directory, type cd ~ and then pwd in your WSL terminal
You can then move your key file there using Windows Explorer or the mv command in WSL

Ubuntu on Windows (WSL)

Step 01: Launching an EC2 instance

Now we are ready to launch an EC2 instance!
Go to the AWS Console: https://aws.amazon.com/console/, sign in, and click on (search for) EC2
You will see the EC2 Dashboard, with a big orange button saying “Launch Instance”
Click on it, and you will be taken to the AMI (Amazon Machine Image) selection page
There, we will choose an Ubuntu AMI (free tier eligible)
Amazon has thousands of AMIs available, including Windows, macOS, and other Linux distributions
Many are already pre-configured to run specific software (e.g. Jupyter, TensorFlow, Docker, etc.)

EC2 Dashboard

Step 02: Choosing an instance type

Type a name to label your VM under Name and tags. For example, name your instance datasci350
I recommend using Ubuntu Server 24.04 LTS as the OS image and 64-bit (x86) as the architecture
You can also specify the number of instances to launch at this time. To start off, choose 1
Next choose an EC2 instance type
To test, you can always create one or multiple t2.micro or t1.micro, both of which are free tier eligible
If you need more power, you can always change the instance type later
So far, so good? 😊

Instance Type

Step 03: Choose an SSH key pair

Next, you will be asked to choose a key pair
You just have to do it once, and then you can use the same key pair for all your instances
A key pair is a pair of cryptographic keys that you use to authenticate to an instance
Click on Create a new key pair and give it a name (e.g. datasci350)
Choose ED25519 and .pem as the file format
Click on Create Key Pair and save it to a secure location
It is important to keep this key safe and know where it is, as it is the only way to access your instance

Step 04: Check HTTP/HTTPS options

Under Network settings, you may check Allow HTTPS traffic from the internet and Allow HTTP traffic from the internet so that you can access the web servers hosted on your EC2 VM
We will allow all traffic for now, but in a production environment, you should restrict traffic to only what is necessary
Why so? Security! The more open ports you have, the more vulnerable your instance is to attacks
You can always edit the security group later to add or remove rules

Step 05: Configure storage

You may also increase the storage capacity of the Root volume under Configure storage
By default you will be allocated a small 8-GB root disk
The free tier gives you 30 GB of storage, but 8 GB is usually enough for testing and learning purposes
You can also add additional volumes if you need more storage
gp3 is the default volume type, and it works fine for most use cases
io2 is the fastest and most expensive volume type, usually used for high-performance databases (which require millisecond latency)
If you have data that is not frequently accessed, you can use sc1 or st1 for cost savings

Now you just have to click on Launch instance! 🚀

Step 06: Log in to the EC2 instance through SSH

Once you click on Launch, you will be taken to the Instances page
Click on Connect to instance to see the instructions

Click on SSH client to see the command to connect to your instance
Open a terminal on your local machine and navigate to the directory where you saved the key pair
Run the chmod 400 "name-of-your-key.pem" and the Example commands provided by AWS to connect to your instance (ssh -i "name-of-your-key.pem" ubuntu@public-ip)

And you are in! 🎉

Welcome to the cloud! 🌥️

Check the instance details on the AWS Console

You can click on the instance ID or on the Instances link on the left to see the details of your instance
You can see the public IP address of your instance, the instance type, the security group, the key pair, and other details
You can also stop, terminate, or reboot your instance from this page (we will see how to do this later)

Using CloudShell

If you don’t have access to a terminal, you can use CloudShell
CloudShell is a browser-based shell that you can use to run commands on your instances
You can find it on the AWS Console, in the bottom left corner
It is a great way to run commands on your instances without having to install anything on your local machine
You can also use it to run commands on your S3 buckets, RDS databases, and other AWS services
I recommend using your local terminal as you can write code in your local IDE and run it on the cloud
But it is nice that AWS provides this option

Step 07: Update and install software

Once you are in, the first thing you should do is to update the package list
Run sudo apt update to refresh the package list
Then, type sudo apt upgrade to install the latest updates
- You can use the -y flag to automatically answer yes to all prompts, e.g., sudo apt update && sudo apt upgrade -y
You can also install software using sudo apt install
For example, to install Python 3, run sudo apt install python3
To install pip, run sudo apt install python3-pip
You can also install Jupyter or any other software you need
And we will be ready to go! 🚀

sudo apt update && sudo apt upgrade -y && sudo apt install -y python3 && sudo apt install -y python3-pip

Adding files to your instance

You can add files to your instance using scp (secure copy)
scp is a command-line tool that allows you to copy files securely
The -i flag (“identity file”) specifies which private key file to use for authentication (the .pem file you downloaded when creating the key pair)
To copy a file from your local machine to your instance, run scp -i "name-of-your-key.pem" file-to-copy ubuntu@public-ip:/path/to/destination
To copy a file from your instance to your local machine, run scp -i "name-of-your-key.pem" ubuntu@public-ip:/path/to/file-to-copy /path/to/destination
You can also use scp to copy files between two remote servers
Windows users can use WinSCP or PuTTY

Open a new bash/zsh local terminal (not the one you are using to connect to your instance!) and type the following (one line at a time):

echo "print('Hello, DATASCI350!')" > hello.py
scp -i qtm350.pem hello.py ubuntu@XXXXXX.compute-1.amazonaws.com:~

Don’t forget to replace XXXXXX with your public IP and note that :~ is your home directory on the instance. Don’t forget to add it!
You can find your public IP on the AWS Console, on the Instances page. It’s the same you copied in the “Example” command when connecting via SSH
You can check (ls) and run (python3) the file on your instance

Stopping and terminating your instance

It is very important to stop or terminate your instance when you are not using it
Stopping an instance will pause it and you will not be charged for it (but you will be charged for the storage)
Terminating an instance will delete it and you will not be charged for it (but you will lose all data)
You can stop or terminate your instance from the Instances page on the AWS Console
You can also reboot your instance if it is not responding
So let’s terminate our instance now! 🛑

Now it is your turn! 🚀

Activity 01

Create an EC2 instance on AWS called jupyter (as we did in the previous slides)
Connect to your instance with port forwarding from your local terminal:
- ssh -i "<your-key>.pem" ubuntu@<public_IPv4_DNS_address> -L 8000:localhost:8888
- The -L 8000:localhost:8888 part forwards port 8888 on the instance to port 8000 on your machine
Update, upgrade, and install Python 3, pip, and Jupyter:
- sudo apt update && sudo apt upgrade -y
- sudo apt install -y python3 python3-pip python3-notebook
- source ~/.profile
Check that everything is installed: which python3, which pip3, which jupyter
Start Jupyter: jupyter notebook
Open your browser and go to http://localhost:8000. Copy the token from the terminal to log in
- You can also click the link in the terminal (it looks like http://localhost:8888/?token=...) and change 8888 to 8000
Create a new notebook and run print('Hello, DATASCI350!') (or any other code you like!)
Do not terminate your instance, we will use it later

Activity 01

Another one? 🚀

Activity 02

Now that you have your EC2 instance running, let’s do some data analysis
We will practise two ways of getting files onto your instance:
- Method 1 (scp): Create a file on your local machine, then upload it to the instance. Use this when you have files on your own computer that are not available online (e.g., your own datasets, private code)
- Method 2 (wget): Download a file directly from the internet to the instance. Use this when the file is already hosted somewhere (e.g., GitHub, a public dataset URL). This skips your local machine entirely
First, install the required packages on your instance:
- sudo apt install -y python3-numpy python3-pandas python3-matplotlib python3-seaborn
Method 1: local to cloud with scp. On your local machine, create a weather dataset with the Python code below, or download it here: weather_data.py

# weather_data.py
import pandas as pd
import numpy as np
import datetime

# Set seed for reproducibility
np.random.seed(42)

# Generate dates for the past 30 days
dates = pd.date_range(end=datetime.datetime.now(), periods=30).tolist()
dates = [d.strftime('%Y-%m-%d') for d in dates]

# Generate temperature data with some randomness
temp_high = np.random.normal(75, 8, 30)
temp_low = temp_high - np.random.uniform(10, 20, 30)
precipitation = np.random.exponential(0.5, 30)
humidity = np.random.normal(65, 10, 30)

# Create a structured dataset
weather_data = pd.DataFrame({
    'date': dates,
    'temp_high': temp_high,
    'temp_low': temp_low,
    'precipitation': precipitation,
    'humidity': humidity
})

# Save to a text file
with open('weather_data.txt', 'w') as f:
    f.write("# Weather data for the past 30 days\n")
    f.write(weather_data.to_string(index=False))
    
print("Weather data saved to weather_data.txt")

Activity 02

Run this script on your local machine: python3 weather_data.py
It will create a file called weather_data.txt with 30 days of weather data
Now upload it to your EC2 instance using scp (from a local terminal):
- scp -i <your-key>.pem weather_data.txt ubuntu@<your-instance-ip>:~/
You can verify it arrived by running ls on your instance
Method 2: internet to cloud with wget. Now let’s get the analysis script directly on the instance, without going through your local machine. Run this on your EC2 instance:

wget https://raw.githubusercontent.com/danilofreire/datasci350/main/lectures/lecture-15/weather_analysis.py (one line)

The code is shown below for reference (you don’t need to copy it, wget already downloaded it):

# weather_analysis.py
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from io import StringIO  # Add proper import for StringIO

# Read the weather data
with open('weather_data.txt', 'r') as f:
    lines = f.readlines()

# Skip the header comment
data_str = ''.join(lines[1:])
df = pd.read_csv(StringIO(data_str), sep=r'\s+')  # Fix StringIO import and use raw string for regex

# Print basic statistics
print("Weather Data Analysis:")
print("=====================")
print(f"Number of days: {len(df)}")
print(f"Average high temperature: {df['temp_high'].mean():.1f}°F")
print(f"Average low temperature: {df['temp_low'].mean():.1f}°F")
print(f"Maximum temperature: {df['temp_high'].max():.1f}°F on {df.loc[df['temp_high'].idxmax(), 'date']}")
print(f"Minimum temperature: {df['temp_low'].min():.1f}°F on {df.loc[df['temp_low'].idxmin(), 'date']}")
print(f"Days with precipitation > 1 inch: {len(df[df['precipitation'] > 1])}")

# Create a visualisation
plt.figure(figsize=(12, 6))
sns.set_style("whitegrid")

# Plot temperature range
plt.fill_between(df['date'], df['temp_low'], df['temp_high'], alpha=0.3, color='skyblue')
plt.plot(df['date'], df['temp_high'], marker='o', color='red', label='High Temp')
plt.plot(df['date'], df['temp_low'], marker='o', color='blue', label='Low Temp')

# Add precipitation as bars on a secondary axis
ax2 = plt.twinx()
ax2.bar(df['date'], df['precipitation'], alpha=0.3, color='navy', width=0.5, label='Precipitation')
ax2.set_ylabel('Precipitation (inches)', color='navy')
ax2.tick_params(axis='y', labelcolor='navy')

# Formatting
plt.title('30-Day Weather Report: Temperature Range and Precipitation', fontsize=16)
plt.xticks(rotation=45, ha='right')
plt.ylabel('Temperature (°F)')
plt.legend(loc='upper left')
plt.tight_layout()

# Save the figure
plt.savefig('weather_analysis.png')
print("Analysis complete. Results saved to 'weather_analysis.png'")

Activity 02

Both files are now on your instance: weather_data.txt (uploaded via scp) and weather_analysis.py (downloaded via wget)
Run the analysis on your instance: python3 weather_analysis.py
Or, if you prefer, open Jupyter (jupyter notebook) and run it there
Download the resulting image back to your local machine (using scp again, from a local terminal):
- scp -i <your-key>.pem ubuntu@<your-instance-ip>:~/weather_analysis.png ./
View the image on your local machine
Don’t forget to terminate your instance when done!

Recap: we used scp to move files between your computer and the instance, and wget to download files from the internet directly to the instance. Both are useful in different situations!

Activity 02 Result

When you run the activity, you should get a weather analysis graph similar to this one:

Activity 02 Result

You’ve just completed a full data analysis workflow in the cloud 🎉

Created data locally
Uploaded it to the cloud
Processed it on a cloud server
Generated visualisations
Downloaded results to your local machine

This workflow is similar to how data scientists use cloud resources for larger datasets and more complex analyses!

Conclusion 🎉

Conclusion

Today we learnt how to launch an EC2 instance on AWS
We connected to our instance using SSH and installed software
We ran a Jupyter notebook on our instance and forwarded the connection to our local machine
We also uploaded a dataset, analysed it, and downloaded the results
We learnt how to use scp to copy files between our local machine and our instance
And we learnt how to stop and terminate our instance
I hope you enjoyed the class and learnt something new today! 🚀

But we are just scratching the surface of what you can do with AWS
Take some time to explore the AWS Console and see what other services are available
- Lambda is particularly easy and useful, and you can run code without provisioning or managing servers
You can also explore S3 for object storage, RDS for managed databases, and SageMaker for machine learning
And remember to always configure billing alarms and budget limits to avoid unexpected charges!

DATASCI 350 - Data Science Computing

Hello, everyone! 👋

Brief recap 📚

Introduction to cloud computing and AWS

Introduction to cloud computing and AWS

Today’s plan 📅

Today’s plan

EC2 from your local machine 🖥️

EC2 instances in a nutshell

Ubuntu Linux

Ubuntu server

Windows Subsystem for Linux (WSL)

Step 01: Launching an EC2 instance

Step 02: Choosing an instance type

Step 03: Choose an SSH key pair

Step 04: Check HTTP/HTTPS options

Step 05: Configure storage

Step 06: Log in to the EC2 instance through SSH

And you are in! 🎉

Welcome to the cloud! 🌥️

Check the instance details on the AWS Console

Using CloudShell

Step 07: Update and install software

Adding files to your instance

Stopping and terminating your instance

Now it is your turn! 🚀

Activity 01

Activity 01

Activity 01

Another one? 🚀

Activity 02

Activity 02

Activity 02

Activity 02 Result

Activity 02 Result

Conclusion 🎉

Conclusion

And that’s a wrap! 🎬

Thank you! 🙏