QTM 350 - Data Science Computing

Lecture 18 - Introduction to Cloud Computing

Danilo Freire

Emory University

Hello, everyone! 👋

Cloud computing ☁️

What is cloud computing?

  • Cloud computing is the on-demand delivery of computing resources through a cloud services platform via the internet with pay-as-you-go pricing
  • Cloud computing has increased in popularity in recent years because it offers many advantages
    • Scalability, flexibility, and cost-efficiency
  • There are three main types of cloud computing services: IaaS, PaaS, and SaaS
  • The most popular cloud computing platforms are Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP), the “Big Three”
  • They are so widely used that Amazon’s AWS is the most profitable division of the company
    • Although it is only 15% of their total sales, they account for more than 50% of their profits (~28 bi)

Source: GeekWire

Why should we care about cloud computing? 🤔

Imagine you are opening a business…

Traditional approach

  1. Estimate supply and demand
  2. Estimate infrastructural needs
  3. Purchase and deploy infrastructure
  4. Install and test your system
  5. Offer your services to clients
  • Infrastructure is very expensive:
  • It takes time and deploy the infrastructure
  • What if the estimations were wrong?

Imagine you are opening a business…

Cloud approach

  1. Choose one or more cloud services providers
  2. Deploy your systems on the cloud
  3. Offer your services to clients
  4. Pay for what you use
Traditional process With Cloud Computing
❌ High investment risk ✅ Reduced risk
❌ Long time-to-market ✅ Shorter time-to-market
✅ Manages own data ❌ Trust the vendor?
✅ Completely in control ❌ Dependant from a specific vendor?

Case study: Animoto

  • Animoto: Lets users create videos from their own photos/music
  • Auto-edits photos and aligns them with the music, so it “looks good”
  • Built using Amazon EC2+S3+SQS
  • Released a Facebook app in mid-April 2008
  • More than 750,000 people signed up within 3 days
  • EC2 usage went from 50 machines to 3,500 (x70 scalability!)
  • No way they could have done this with traditional infrastructure!

Source: Jeff Bezos’ talk at Stanford in 2008

Case study: The Washington Post

  • Hillary Clinton’s official White House schedule released to the public
  • 17,481 pages of non-searchable, low-quality PDF
  • Very interesting to journalists, but would have required hundreds of man-hours to evaluate
  • Peter Harkins, Senior Engineer at The Washington Post: “Can we make that data available more quickly, ideally within the same news cycle?”
  • Tested various Optical Character Recognition (OCR) programs; estimated required speed
  • Launched 200 EC2 instances; project was completed within nine hours using 1,407 hours of VM time ($144.62)
  • Results available on the web only 26 hours after the release

Cloud computing services ☁️

Cloud computing services

  • SaaS (Software as a Service):
    • Software is hosted on a cloud and accessed via the internet as a complete application
    • You don’t have to worry about updates, security, or maintenance
    • You (usually) pay a subscription fee (monthly or yearly)
    • Examples: Gmail, Office 365, Salesforce
  • PaaS (Platform as a Service):
    • Provides a platform allowing customers to develop, run, and manage applications without the complexity of building and maintaining the infrastructure
    • Usually, the platform offers a series of APIs
    • You pay for the platform and the resources you use (e.g., storage, bandwidth)
    • Examples: Google App Engine, AWS Elastic Beanstalk, Heroku

Cloud computing services

SaaS

Cloud computing services

PaaS

Cloud computing services

Cloud computing services

FaaS

Cloud computing services

IaaS

Comparison of cloud services

On-premise IaaS PaaS SaaS
Application Application Application Application
Middleware Middleware Middleware Middleware
OS OS OS OS
Virtualisation Virtualisation Virtualisation Virtualisation
Servers Servers Servers Servers
Networking Networking Networking Networking

User manages
Provider manages

What is virtualisation?

  • Suppose Alice has a machine with 4 CPUs and 8 GB of memory, and three customers:
    • Bob wants a machine with 1 CPU and 3GB of memory
    • Charlie wants 2 CPUs and 1GB of memory
    • Daniel wants 1 CPU and 4GB of memory
  • What should Alice do?
  • Alice can sell each customer a virtual machine (VM) with the requested resources
  • From each customer’s perspective, it appears as if they had a physical machine all by themselves (isolation)

More about virtualisation

What is middleware?

  • Middleware is software that connects two separate applications
  • This is software we rarely see, but which is essential for the functioning of the internet
  • For instance, software than handles authentication, authorisation, and encryption, drivers that connect to databases, and software that handles messaging
  • Amazon SQS, Apache Kafka, and RabbitMQ are examples of middleware

The Big Three ☁️

Amazon Web Services (AWS)

  • Amazon Web Services is a collection of cloud-based services
  • It’s a very big one
  • Let me say it again: a VERY big one 😂
  • They offer a wide range of services:
    • Compute, storage, databases, analytics, machine learning, AI, IoT, security, and more
  • Many companies use AWS, including Netflix, Airbnb, and NASA
  • The most widely used services are EC2, S3, and RDS, for computing, storage, and databases, respectively
  • Let’s look at them in more detail

AWS services 🛠️

Amazon EC2 - Elastic Compute Cloud

  • Amazon EC2 is a web service that provides resizable compute capacity in the cloud
  • It’s designed to make web-scale cloud computing easier for developers
  • You can launch instances with a variety of operating systems (mainly Linux)
  • Which means that, for the most part, you can use bash and run any software you want from the command line
  • You can also use a GUI interface, but it is not necessary
  • You can use EC2 to host a website, run a database, or run a machine learning model
  • However, please be mindful of the costs!
  • You pay for the instances you use, and the costs can add up quickly
  • Per-second (or per-hour) billing
  • Data transfer not included!
  • Persistent storage not included!
  • Scaling not included!

Cloud storage

Amazon S3 - Simple Storage Service

  • Amazon has three main storage services: S3, EBS, and EFS
  • S3 is a web service that provides object storage through a web interface
  • You can store data in buckets, which are like folders
  • Data are distributed across a minimum of three availability zones
    • 99.9999999% durability (nine nines!)
  • You can set up lifecycle policies to automatically move data to cheaper storage classes
    • Standard, Intelligent-Tiering, Glacier, Glacier Deep Archive
  • You can also set up versioning to keep multiple versions of an object (works like git)

Cloud storage

Amazon EBS - Elastic Block Store

  • EBS is a web service that provides block storage volumes for use with EC2 instances
  • The difference between S3 and EBS is that EBS is block storage, while S3 is object storage
    • Block storage is like a hard drive, while object storage is like a file system
  • Why would you use it? Mainly for SQL files
    • Low-latency, high-performance storage for frequent access
  • You can attach EBS volumes to your EC2 instances
  • You can also take snapshots of your volumes for backup (like git again!)
  • Up to 16 TB per volume

Database services

Amazon RDS - Relational Database Service

  • Amazon RDS is a web service that makes it easy to set up, operate, and scale a relational database
  • You can choose from six popular database engines:
    • Amazon Aurora, PostgreSQL, MySQL, MariaDB, Oracle, and SQL Server
  • You can also use Aurora Serverless to automatically scale your database
  • It’s a managed service, so you don’t have to worry about backups, patches, or updates
  • It works well with another AWS service called Athena, which allows you to query data in S3 using SQL
  • You can also use Redshift for data warehousing, and DynamoDB for NoSQL databases

Amazon SageMaker

  • Finally, let’s talk about Amazon SageMaker
  • SageMaker is a fully managed service that allows you to build, train, and deploy machine learning models at scale
  • It has several pre-configured models, such as TensorFlow, PyTorch, and Scikit-learn
  • You can also bring your own models too, and use Jupyter notebooks to train them
  • Lots of other models available in the AWS Marketplace, and you can also use SageMaker Studio to manage your projects
  • Machine Learning: application services
    • Comprehend (for NLP)
    • Rekognition (Visual Analysis)
    • Translate
    • Polly (text-to-speech)

Creating and managing an AWS account 🛠️

Creating an AWS account

You can do this later if you want

  • To create an AWS account, go to https://aws.amazon.com
  • Enter your account information, and then choose Verify email address
  • You will receive an email with a verification link
  • Select Personal and choose Continue
  • Enter your billing information, yes, you need to 😒
  • Enter the code displayed in the CAPTCHA, and then submit
  • Choose Complete sign up
  • And you’re done 🎉

Managing an AWS account

  • The first thing you should do is to use a billing alarm
    • You can set up a billing alarm to notify you when your bill exceeds a certain amount
    • This is very important, as costs can add up quickly
  • Go to https://console.aws.amazon.com/costmanagement/
  • In the navigation pane, choose Billing Preferences (scroll down to the bottom)
  • By Alert preferences choose Edit
  • Choose Receive AWS Free Tier Alerts
  • Choose Save preferences
  • Then go to https://us-east-1.console.aws.amazon.com/billing/home#/budgets/overview
  • Choose Create budget, choose a Zero spend budget template, add your email, then click Create budget

Managing an AWS account

Audio transcription and sentiment analysis on AWS

Audio transcription on AWS

  • Imagine that you have a task similar to that of the Washington Post journalists
  • You found a series of audio files that you need to transcribe, and you have to do it quickly
  • So let’s create an S3 bucket (folder), upload an audio file, transcribe it, download the subtitles, and then analyse the sentiment of the text
  • We will use Amazon Transcribe and Amazon Comprehend to do this
  • The file we will use is available here: JFK - We Choose to Go to the Moon
  • Select AWS Management Console to open the console, then search for S3 in the search bar
  • Click on Create bucket, give it a name, accept the permissions, and click Create bucket

Audio transcription at scale

S3 bucket

Then click on the bucket and upload the file

Create transcription job

  • Now let’s go to Amazon Transcribe
  • Just search for Transcribe in the search bar
  • Click on Create job
  • Give it a name, select the language, and choose the S3 bucket where the file is located
  • Click Create

Transcription job

Transcription job

Download the transcription

  • Now you can download the transcription file in JSON format
  • Click on the Download button, and save the file
  • Now open the file in VS Code or any text editor
  • You will see the transcription in JSON format, with the text and the timestamps
  • We only need the text, so let’s extract it
  • Find the results key, and then the transcripts key
  • The text is in the transcript key, so copy it
  • We will then go to Amazon Comprehend and analyse it

Download the transcription

Analyse the transcription

  • Now let’s go to Amazon Comprehend
  • Search for Comprehend in the search bar
  • Click on Launch Amazon Comprehend
  • Scroll down and paste the text in the Input data box
  • Click on Analyze
  • And that’s it! You will see the results in the Insight box!

Delete the transcription job and S3 bucket

  • Now that we are done, we can delete the transcription job and the S3 bucket
  • Go to Amazon Transcribe again and select the transcription job
  • Click on Delete job and confirm
  • Then go to Amazon S3 and select the bucket
  • First you need to delete the objects inside the bucket. Just select the file and click on Delete
  • Type “permanently delete” to confirm
  • Then click on the bucket, type the bucket name, and click on Delete bucket

Conclusion 🎉

What we learned today

  • Why cloud computing is important and how it can help your business
  • The Big Three cloud computing platforms: AWS, Azure, and GCP
  • The main types of cloud computing services: IaaS, PaaS, SaaS, FaaS
  • The most popular AWS services: EC2, S3, RDS, SageMaker
  • How to create an AWS account and set up a billing alarm
  • How to create an S3 bucket, upload an audio file, and transcribe it with Amazon Transcribe
  • And that’s just the beginning!
  • In the next sessions, we will learn more about AWS services, and how to use them to build and deploy applications
  • Remember to close your EC2 and S3 instances when you’re done!

And that’s a wrap! 🎬

Thank you very much!
See you next time! 😊🙏🏽