QTM 350 - Data Science Computing

Lecture 15 - Cloud Computing II

Danilo Freire

Emory University

Hello, everyone! 👋

Brief recap 📚

Introduction to cloud computing and AWS

  • Cloud computing fundamentals:
    • On-demand resource delivery with pay-as-you-go pricing models
    • Elastic scalability, measured service, resource pooling
    • Service models: IaaS (virtualisation), PaaS (managed platforms), SaaS (full applications)
  • Enterprise case studies:
    • Animoto: Scaled from 50 to 3,500 EC2 instances during viral Facebook app growth
    • Washington Post: Processed 17,481 pages of Clinton schedules in 9 hours using 200 EC2 instances
    • Netflix: Migrated entirely to AWS for global video streaming infrastructure
On-premise IaaS PaaS SaaS
Application Application Application Application
Middleware Middleware Middleware Middleware
OS OS OS OS
Virtualisation Virtualisation Virtualisation Virtualisation
Servers Servers Servers Servers
Networking Networking Networking Networking

User manages
Provider manages

Introduction to cloud computing and AWS

  • AWS architecture components:
    • EC2: Resizable compute capacity with auto-scaling
    • S3: 11x9s durability object storage with lifecycle policies
    • RDS: Managed relational databases (PostgreSQL/MySQL, etc)
    • SageMaker: ML workflows with Jupyter integration
  • Cost management:
    • Always configure billing alarms and budget limits!
    • Monitor data transfer costs and storage class transitions

Today’s plan 📅

Today’s plan

  • Using EC2 instances from your local machine
    • Launching instances via AWS Console
    • Connecting via SSH with key pairs
  • Essential Linux commands for instance management:
    • Package management: apt update, apt upgrade, apt install`
    • File operations: scp and chmod
  • Activity 01: Launching a Jupyter notebook on an EC2 instance
    • Installing Jupyter and running a Python script
    • Forwarding ports with SSH
    • Running Jupyter on a remote server
  • Activity 02: Data analysis on the cloud (time permitting)
    • Creating a weather dataset and uploading it to an EC2 instance
    • Analysing the data with Python and downloading the results

EC2 from your local machine 🖥️

EC2 instances in a nutshell

  • Just to recap: EC2 is a virtual server in the cloud
  • Each instance has a full OS and can run any software
  • Virtualisation allows multiple instances on the same hardware
  • Instance types vary in CPU, memory, storage, and network capacity
    • T2 instances are general-purpose, M5 are memory-optimised, etc
    • You can change instance types on-the-fly
  • The free tier offers 750 hours of t2.micro instances per month, 1Gb of RAM, and 30Gb of storage

Source: DataCamp

Ubuntu Linux

  • Linux is a popular open-source operating system
  • They are popular precisely because they are free, but they are also more secure than Windows and can be customised to your needs (in contrast to MacOS)
  • The most widely used Linux distribution is Ubuntu
  • Google, Amazon, and Microsoft all offer Ubuntu instances
  • Windows Subsystem for Linux (WSL) allows you to run Ubuntu on Windows
  • For our purposes, we will use Ubuntu instances on AWS
  • All Linux distributions share a common set of commands and use bash as the default shell
  • Therefore, all commands we have learnt so far will work on Ubuntu

Ubuntu server

  • While Ubuntu is a complete OS (with a GUI), we will use the terminal to interact with our instances
  • Why should we only use the terminal?
    • A bare-bones Ubuntu instance is cheaper and faster than a full desktop
    • And as you already know, the terminal has everything you need to manage your instance and run your code
  • Again, all commands are the same (ls, cd, pwd, etc)
  • However, you have to use sudo to run commands as root (superuser)
    • This is because root has full control over the system and can do anything
    • Be careful with root!
  • To install software, use apt (Advanced Package Tool)
  • apt update refreshes the package list
  • apt install installs a package
  • sudo apt python3 installs Python 3
  • sudo apt install python3-pip installs pip

Windows Subsystem for Linux (WSL)

  • WSL is a compatibility layer for running Linux binary executables natively on Windows
  • It allows you to run a full-fledged Ubuntu terminal on Windows
  • I strongly recommend using WSL if you want to use a bash-like terminal
  • I hope many of you have already installed it! 😊
  • More about WSL here: https://learn.microsoft.com/en-us/windows/wsl/install
  • If you work with WSL, make sure to store your SSH key file under your Ubuntu home directory
  • Windows FS does not work well with chmod 400
  • How to locate your /home/user directory is by typing cd and that will automatically bring you to your home dir

Ubuntu on Windows (WSL)

Step 01: Launching an EC2 instance

  • Now we are ready to launch an EC2 instance!
  • Go to the AWS Console: https://aws.amazon.com/console/, sign in, and click on (search for) EC2
  • You will see the EC2 Dashboard, with a big orange button saying “Launch Instance”
  • Click on it, and you will be taken to the AMI (Amazon Machine Image) selection page
  • There, we will choose an Ubuntu AMI (free tier eligible)
  • Amazon has thousands of AMIs available, including Windows, MacOS, and other Linux distributions
  • Many are already pre-configured to run specific software (e.g. Jupyter, TensorFlow, Docker, etc)

Step 02: Choosing an instance type

  • Type a name to label your VM under Name and tags. For example, name your instance qtm350
  • I recommend using Ubuntu Server 24.04 LTS as the OS image and 64-bit (x86) as the architecture
  • You can also specify the number of instances to launch at this time. To start off, choose 1
  • Next choose an EC2 instance type
  • To test, you can always create one or multiple t2.micro or t1.micro, both of which are free tier eligible
  • If you need more power, you can always change the instance type later
  • So far, so good? 😊

Steps 03: Choose an SSH key pair

  • Next, you will be asked to choose a key pair
  • A key pair is a pair of cryptographic keys that you use to authenticate to an instance
  • Click on Create a new key pair and give it a name (e.g. qtm350)
  • Choose ED25519 and .pem as the file format
  • Click on Create Key Pair and save it to a secure location
  • It is important to keep this key safe and know where it is, as it is the only way to access your instance
  • You can also use an existing key pair if you have one

Step 04: Check HTTP/HTTPS options

  • Under Network settings, you may check Allow HTTPs traffic from the internet and Allow HTTP traffic from the internet so that you can access the web servers hosted on your EC2 VM
  • We will allow all traffic for now, but in a production environment, you should restrict traffic to only what is necessary
  • Why so? Security! The more open ports you have, the more vulnerable your instance is to attacks
  • You can always edit the security group later to add or remove rules

Step 05: Configure storage

  • You may also increase the storage capacity of the Root volume under Configure storage
  • By default you will be allocated a small 8-GB root disk
  • The free tier gives you 30 GB of storage, so you can increase the size to 30 GB
  • You can also add additional volumes if you need more storage
  • gp3 is the default volume type, and it works fine for most use cases
  • io2 is the fastest and most expensive volume type, usually used for high-performance databases (which require milisecond latency)
  • If you have data that is not frequently accessed, you can use sc1 or st1 for cost savings

  • Now you just have to click on Launch instance! 🚀

Step 06: Login to the EC2 instance through SSH

  • Once you click on Launch, you will be taken to the Instances page
  • Click on Connect to instance to see the instructions

  • Click on SSH client to see the command to connect to your instance
  • Open a terminal on your local machine and navigate to the directory where you saved the key pair
  • Run the chmod 400 "name-of-your-key.pem" and the Example commands provided by AWS to connect to your instance

And you are in! 🎉

Welcome to the cloud! 🌥️

Check the instance details on the AWS Console

  • You can click on the instance ID or on the Instances link on the left to see the details of your instance
  • You can see the public IP address of your instance, the instance type, the security group, the key pair, and other details
  • You can also stop, terminate, or reboot your instance from this page (we will see how to do this later)

Using CloudShell

  • If you don’t have access to a terminal, you can use CloudShell
  • CloudShell is a browser-based shell that you can use to run commands on your instances
  • You can find it on the AWS Console, on the bottom left corner
  • It is a great way to run commands on your instances without having to install anything on your local machine
  • You can also use it to run commands on your S3 buckets, RDS databases, and other AWS services
  • I recommend using your local terminal as you can write code in your local IDE and run it on the cloud
  • But it is nice that AWS provides this option

Step 07: Update and install software

  • Once you are in, the first thing you should do is to update the package list
  • Run sudo apt update to refresh the package list
  • Then, type sudo apt get upgrade to install the latest updates
    • You can use the -y flag to automatically answer yes to all prompts
  • You can also install software using sudo apt install
  • For example, to install Python 3, run sudo apt install python3
  • To install pip, run sudo apt install python3-pip
  • You can also install Jupyter or any other software you need
  • And we will be ready to go! 🚀

sudo apt update && sudo apt upgrade && sudo apt install python3 && sudo apt install -y python3-pip

Adding files to your instance

  • You can add files to your instance using scp (secure copy)
  • scp is a command-line tool that allows you to copy files securely
  • To copy a file from your local machine to your instance, run scp -i "name-of-your-key.pem" file-to-copy ubuntu@public-ip:/path/to/destination
  • To copy a file from your instance to your local machine, run scp -i "name-of-your-key.pem" ubuntu@public-ip:/path/to/file-to-copy /path/to/destination
  • You can also use scp to copy files between two remote servers
  • Windows users can use WinSCP or PuTTY
  • Open a new terminal (not the one you are using to connect to your instance) and type the following
echo "print('Hello, QTM350!')" > hello.py
scp -i qtm350.pem hello.py ubuntu@XXXXXX.compute-1.amazonaws.com:~ 

  • Don’t forget to replace XXXXXX with your public IP and ~ with the path to your home directory
  • You can check (ls) and run (python3) the file on your instance

Stopping and terminating your instance

  • It is very important to stop or terminate your instance when you are not using it
  • Stopping an instance will pause it and you will not be charged for it (but you will be charged for the storage)
  • Terminating an instance will delete it and you will not be charged for it (but you will lose all data)
  • You can stop or terminate your instance from the Instances page on the AWS Console
  • You can also reboot your instance if it is not responding
  • So let’s stop our instance now! 🛑

Now it is your turn! 🚀

Activity 01

  • Create an EC2 instance on AWS called jupyter
  • Connect to your instance using SSH
  • Update the package list, and install Python 3, pip, and Jupyter
    • sudo apt install python3-notebook
    • source ~/.profile
  • Test it with which python3, which pip3, and which jupyter
  • Open a new terminal and type this to forward your connection to port 8000
    • ssh -i "<your-key>.pem" ubuntu@<public_IPv4_DNS_address_of_your_EC2_instance> -L 8000:localhost:8888
  • Start your Jupyter notebook with jupyter notebook
  • Open your browser and go to http://localhost:8000
  • Copy and paste the token from the terminal
  • Create a new notebook and run print('Hello, QTM350!') (or any other code you like!)
  • Do not terminate your instance, we will use it later

Let’s see the code! 🚀

Activity 01

sudo apt update && sudo apt upgrade
sudo apt install python3 python3-pip python3-notebook
source ~/.profile
which python3
which pip3
which jupyter
ssh -i qtm350.pem ubuntu@XXXXX -L 8000:localhost:8888
jupyter notebook

Activity 01

Activity 01

Another one? 🚀

Activity 02

  • Now that you have your EC2 instance running, let’s do some data analysis
  • Let’s create a simple weather dataset, upload it to our EC2 instance, analyse it, and download the results
  • Install the required packages:
    • sudo apt install python3-numpy python3-pandas python3-matplotlib python3-seaborn
  • On your local machine, create a simple weather dataset with the Python code below, or download it here: weather_data.py
# weather_data.py
import pandas as pd
import numpy as np
import datetime

# Set seed for reproducibility
np.random.seed(42)

# Generate dates for the past 30 days
dates = pd.date_range(end=datetime.datetime.now(), periods=30).tolist()
dates = [d.strftime('%Y-%m-%d') for d in dates]

# Generate temperature data with some randomness
temp_high = np.random.normal(75, 8, 30)
temp_low = temp_high - np.random.uniform(10, 20, 30)
precipitation = np.random.exponential(0.5, 30)
humidity = np.random.normal(65, 10, 30)

# Create a structured dataset
weather_data = pd.DataFrame({
    'date': dates,
    'temp_high': temp_high,
    'temp_low': temp_low,
    'precipitation': precipitation,
    'humidity': humidity
})

# Save to a text file
with open('weather_data.txt', 'w') as f:
    f.write("# Weather data for the past 30 days\n")
    f.write(weather_data.to_string(index=False))
    
print("Weather data saved to weather_data.txt")

Activity 02

  • Run this script on your local machine: python3 weather_data.py
  • It will create a file called weather_data.txt with 30 days of weather data
  • Upload the dataset to your EC2 instance:
    • scp -i <your-key>.pem weather_data.txt ubuntu@<your-instance-ip>:~/
  • Create a another Python script to analyse the data:
  • Either create a new file on your local machine or use the code: weather_analysis.py
# weather_analysis.py
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from io import StringIO  # Add proper import for StringIO

# Read the weather data
with open('weather_data.txt', 'r') as f:
    lines = f.readlines()

# Skip the header comment
data_str = ''.join(lines[1:])
df = pd.read_csv(StringIO(data_str), sep=r'\s+')  # Fix StringIO import and use raw string for regex

# Print basic statistics
print("Weather Data Analysis:")
print("=====================")
print(f"Number of days: {len(df)}")
print(f"Average high temperature: {df['temp_high'].mean():.1f}°F")
print(f"Average low temperature: {df['temp_low'].mean():.1f}°F")
print(f"Maximum temperature: {df['temp_high'].max():.1f}°F on {df.loc[df['temp_high'].idxmax(), 'date']}")
print(f"Minimum temperature: {df['temp_low'].min():.1f}°F on {df.loc[df['temp_low'].idxmin(), 'date']}")
print(f"Days with precipitation > 1 inch: {len(df[df['precipitation'] > 1])}")

# Create a visualisation
plt.figure(figsize=(12, 6))
sns.set_style("whitegrid")

# Plot temperature range
plt.fill_between(df['date'], df['temp_low'], df['temp_high'], alpha=0.3, color='skyblue')
plt.plot(df['date'], df['temp_high'], marker='o', color='red', label='High Temp')
plt.plot(df['date'], df['temp_low'], marker='o', color='blue', label='Low Temp')

# Add precipitation as bars on a secondary axis
ax2 = plt.twinx()
ax2.bar(df['date'], df['precipitation'], alpha=0.3, color='navy', width=0.5, label='Precipitation')
ax2.set_ylabel('Precipitation (inches)', color='navy')
ax2.tick_params(axis='y', labelcolor='navy')

# Formatting
plt.title('30-Day Weather Report: Temperature Range and Precipitation', fontsize=16)
plt.xticks(rotation=45, ha='right')
plt.ylabel('Temperature (°F)')
plt.legend(loc='upper left')
plt.tight_layout()

# Save the figure
plt.savefig('weather_analysis.png')
print("Analysis complete. Results saved to 'weather_analysis.png'")

Activity 02

  • Upload the files to your EC2 instance:
    • scp -i <your-key>.pem <file> ubuntu@<your-instance-ip>:~/
  • Create a Jupyter notebook session with port forwarding:
    • ssh -i <your-key>.pem ubuntu@<your-instance-ip> -L 8000:localhost:8888
    • jupyter notebook
  • Open Jupyter in your browser (http://localhost:8000) and run your analysis
  • Download the resulting image to your local machine:
    • From a new terminal on your local machine:
    • scp -i <your-key>.pem ubuntu@<your-instance-ip>:~/weather_analysis.png ./
  • View the image on your local machine
  • Don’t forget to terminate your instance when done!

Activity 02 Result

When you run the activity, you should get a weather analysis graph similar to this one:

Activity 02 Result

You’ve just completed a full data analysis workflow in the cloud 🎉

  1. Created data locally
  2. Uploaded it to the cloud
  3. Processed it on a cloud server
  4. Generated visualisations
  5. Downloaded results to your local machine

This workflow is similar to how data scientists use cloud resources for larger datasets and more complex analyses!

Conclusion 🎉

Conclusion

  • Today we learned how to launch an EC2 instance on AWS
  • We connected to our instance using SSH and installed software
  • We ran a Jupyter notebook on our instance and forwarded the connection to our local machine
  • We also uploaded a dataset, analysed it, and downloaded the results
  • We learned how to use scp to copy files between our local machine and our instance
  • And we learned how to stop and terminate our instance
  • I hope you enjoyed the class and learned something new today! 🚀
  • But we are just scratching the surface of what you can do with AWS
  • Take some time to explore the AWS Console and see what other services are available
    • Lambda is particularly easy and useful, and you can run code without provisioning or managing servers
  • You can also explore S3 for object storage, RDS for managed databases, and SageMaker for machine learning
  • And remember to always configure billing alarms and budget limits to avoid unexpected charges!

And that’s a wrap! 🎬

Thank you! 🙏