Processing math: 100%
+ - 0:00:00
Notes for current slide
Notes for next slide

Lecture 7

Hadoop and Spark

Tyler Ransom

ECON 5253, University of Oklahoma

1 / 19

Plan for the day

  • How RDD systems (Hadoop, Spark, ...) work

  • How to run computing jobs on OSCER

2 / 19

What is Hadoop?

  • Hadoop is a distributed file system created by Apache Software Foundation

  • It allows for efficient processing of big data sets by distributing data processing computations across a cluster of computers

  • It has 3 properties:

    • open source

    • scalable

    • fault tolerant

3 / 19

Hadoop framework

The Hadoop framework is comprised of the following elements:

  • Hadoop Common: the base set of libraries

  • Hadoop Distributed File System (HDFS): file system that stores data on many machines

  • Hadoop YARN: resource manager for Hadoop jobs ("YARN" stands for "Yet Another Resource Negotiator")

  • Hadoop MapReduce: an implementation of the MapReduce programming model for large-scale data processing

4 / 19

What is MapReduce?

  • "MapReduce" is a general programming model for distributing complex or intensive computations across multiple machines.

  • The main workflow of the MapReduce model is:

  1. Split: divide your huge dataset into K chunks / distribute across machines

  2. Map: apply some function (in parallel) to each of the chunked-up datasets

    • e.g. count words, shuffle, sort, ...
  3. Reduce: reduce the number of chunks in the dataset by combining chunks either by merging the datasets back together, or performing some other aggregating operation (like addition or averaging)

5 / 19

MapReduce visualization

MapReduce1

6 / 19

MapReduce visualization

MapReduce2

7 / 19

Other Apache products

In addition to the above four elements of the Hadoop framework, Apache has continued to create higher-level programs for a more tailored Big Data experience:

  • Pig: a SQL-style language for MapReduce tasks (think of it as a SQL-style shell scripting language for MapReduce)
  • Hive: a SQL-style interface to query data stored in the HDFS
  • Spark: an innovation on Hadoop MapReduce that relaxes some constraints on how MapReduce jobs must be structured

All of the Apache products work in the HDFS environment.

Trivia: Hadoop was created in 2006 following the original MapReduce programming model and the original Google File System, both of which were created by Google in the early 2000s.

8 / 19

What is Spark?

  • Spark was released in 2010 as a generalization of the MapReduce concept for HDFS

  • Essentially, the creators of Spark wanted to allow for more types of operations to fit into the MapReduce model

9 / 19

Spark's innovations over Hadoop

  • faster
  • consumes fewer resources
  • includes APIs to interface with common programming languages
  • simplified user experience
  • interface with Machine Learning libraries
  • handling of streamed data (e.g. twitter, Fb timelines which are continuously generated)
  • includes an interactive prompt (in Scala)

Spark does all this while also satisfying the primary goals of Hadoop (open source, scalable, and fault tolerant)

10 / 19

Spark APIs

Spark's API has been developed for the following languages:

  • R (SparkR, sparklyr)
  • Python (PySpark)
  • Julia (Spark.jl)
  • SQL (spark-sql)
  • Java
  • Scala

Many of these are included on the OSCER computing cluster and can be run interactively

11 / 19

RDDs

  • RDDs (Resilient Distributed Datasets) are the fundamental units of data in Spark

  • RDDs can be created in three ways:

  1. By creating a parallelized collection

  2. By converting a CSV, JSON, TSV, etc. to an RDD

  3. By applying a transformation operation to an existing RDD

12 / 19

Immutability of RDDs

  • RDDs are immutable!

  • This means they cannot be modified

  • This turns out to be a highly efficient way of organizing computing operations, both in terms of minimizing computational burden and in terms of maintaining fault tolerance

13 / 19

Usefulness of immutability?

  • If RDDs can't be modified, how can we apply functions to them?

  • There are two (non-mututally-exclusive) ways:

  1. Transformation: create a new RDD which is a copy of the previous RDD, but with the new operation applied
    • e.g. map
    • e.g. returning unique values of the source dataset
    • e.g. sorting by a key
  2. Action: return some final result (either to the node we originated from, or to an external file)
    • e.g. reduce
    • e.g. returning the first n elements of the dataset
    • e.g. returning a random sample of the dataset
14 / 19

Hadoop Example: Pride & Prejudice

  • We'll walk through an example on OSCER that uses Java to call Hadoop and count instances of every word in Pride & Prejudice using 20 nodes and output the result to a text file

  • See in-class demonstration for more details

15 / 19

Spark Example: Computing π

  • We can derive pi by drawing pairs of random numbers between 0 and 1 and counting what fraction of them are less than or equal to the curve x^2 + y^2 = 1

  • Then, because we know that the area of a circle is pi*radius^2, and we know that the radius of x^2 + y^2 = 1 is 1, we can multiply the area we computed by 4 (since our random numbers only fall in the 1st Quadrant) to get an estimate of pi

16 / 19

Computing π: visualization

MonteCarlo

17 / 19

PySpark code

from __future__ import print_function
import sys
from random import random
from operator import add
from pyspark.sql import SparkSession
if __name__ == "__main__":
"""
Usage: pi [partitions]
"""
spark = SparkSession\
.builder\
.appName("PythonPi")\
.getOrCreate()
partitions = int(sys.argv[1]) if len(sys.argv) > 1 else 2
n = 100000 * partitions
def f(_):
x = random() * 2 - 1
y = random() * 2 - 1
return 1 if x ** 2 + y ** 2 <= 1 else 0
count = spark.sparkContext.parallelize(range(1, n + 1), partitions).map(f).reduce(add)
print("Pi is roughly %f" % (4.0 * count / n))
spark.stop()
18 / 19

Plan for the day

  • How RDD systems (Hadoop, Spark, ...) work

  • How to run computing jobs on OSCER

2 / 19
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow