Lecture 7

class: center, middle, inverse, title-slide

# Lecture 7
## Hadoop and Spark
### Tyler Ransom
### ECON 5253, University of Oklahoma

---

# Plan for the day

- How RDD systems (Hadoop, Spark, ...) work

- How to run computing jobs on OSCER

---
# What is Hadoop?

- Hadoop is a .hi[distributed file system] created by Apache Software Foundation

- It allows for efficient processing of big data sets by distributing data processing computations across a cluster of computers

- It has 3 properties:
    - open source
    
    - scalable
    
    - fault tolerant

---
# Hadoop framework

The Hadoop framework is comprised of the following elements:
    
- .hi[Hadoop Common:] the base set of libraries

- .hi[Hadoop Distributed File System (HDFS):] file system that stores data on many machines

- .hi[Hadoop YARN:] resource manager for Hadoop jobs ("YARN" stands for "Yet Another Resource Negotiator")

- .hi[Hadoop MapReduce:] an implementation of the MapReduce programming model for large-scale data processing

---
# What is MapReduce?

- "MapReduce" is a general programming model for distributing complex or intensive computations across multiple machines.

- The main workflow of the MapReduce model is:
    
1. .hi[Split:] divide your huge dataset into K chunks / distribute across machines

2. .hi[Map:] apply some function (in parallel) to each of the chunked-up datasets 
    * e.g. count words, shuffle, sort, ...
    
3. .hi[Reduce:] reduce the number of chunks in the dataset by combining chunks either by merging the datasets back together, or performing some other aggregating operation (like addition or averaging)

---
# MapReduce visualization

.center[
![MapReduce1](mapreduce_work.jpg)
]

---
# MapReduce visualization

.center[
![MapReduce2](marylittlelamb_mapreduce.png)
]

---
# Other Apache products

In addition to the above four elements of the Hadoop framework, Apache has continued to create higher-level programs for a more tailored Big Data experience:

- .hi[Pig:] a SQL-style language for MapReduce tasks (think of it as a SQL-style shell scripting language for MapReduce)
- .hi[Hive:] a SQL-style interface to query data stored in the HDFS
- .hi[Spark:] an innovation on Hadoop MapReduce that relaxes some constraints on how MapReduce jobs must be structured

All of the Apache products work in the HDFS environment.

.hi[Trivia:] Hadoop was created in 2006 following the original MapReduce programming model and the original Google File System, both of which were created by Google in the early 2000s.

---
# What is Spark?

- Spark was released in 2010 as a generalization of the MapReduce concept for HDFS

- Essentially, the creators of Spark wanted to allow for more types of operations to fit into the MapReduce model

---
# Spark's innovations over Hadoop

* faster
* consumes fewer resources
* includes APIs to interface with common programming languages
* simplified user experience
* interface with Machine Learning libraries
* handling of streamed data (e.g. twitter, Fb timelines which are continuously generated)
* includes an interactive prompt (in Scala)

Spark does all this while also satisfying the primary goals of Hadoop (open source, scalable, and fault tolerant)

---
# Spark APIs

Spark's API has been developed for the following languages:
    
* R (SparkR, sparklyr)
* Python (PySpark)
* Julia (Spark.jl)
* SQL (spark-sql)
* Java
* Scala

Many of these are included on the OSCER computing cluster and can be run interactively

---
# RDDs

- RDDs (Resilient Distributed Datasets) are the fundamental units of data in Spark

- RDDs can be created in three ways:
    
1. By creating a parallelized collection

2. By converting a CSV, JSON, TSV, etc. to an RDD

3. By applying a transformation operation to an existing RDD

---
# Immutability of RDDs

- RDDs are .hi[immutable]!

- This means they cannot be modified

- This turns out to be a highly efficient way of organizing computing operations, both in terms of minimizing computational burden and in terms of maintaining fault tolerance

---
# Usefulness of immutability?

- If RDDs can't be modified, how can we apply functions to them?

- There are two (non-mututally-exclusive) ways:

1. .hi[Transformation:] create a new RDD which is a copy of the previous RDD, but with the new operation applied
    * e.g. `map`
    * e.g. returning unique values of the source dataset
    * e.g. sorting by a key
2. .hi[Action:] return some final result (either to the node we originated from, or to an external file)
    * e.g. `reduce`
    * e.g. returning the first `n` elements of the dataset
    * e.g. returning a random sample of the dataset

---
# Hadoop Example: *Pride & Prejudice*
    
* We'll walk through an example on OSCER that uses Java to call Hadoop and count instances of every word in *Pride & Prejudice* using 20 nodes and output the result to a text file

* See in-class demonstration for more details

---
# Spark Example: Computing `\(\pi\)`

- We can derive pi by drawing pairs of random numbers between 0 and 1 and counting what fraction of them are less than or equal to the curve `x^2 + y^2 = 1`

- Then, because we know that the area of a circle is `pi*radius^2`, and we know that the radius of `x^2 + y^2 = 1` is 1, we can multiply the area we computed by 4 (since our random numbers only fall in the 1st Quadrant) to get an estimate of pi

---
# Computing `\(\pi\)`: visualization

.center[
![MonteCarlo](../Graphics/MCexample10.png)
]

---
# PySpark code

.scroll-box-18[
```python
from __future__ import print_function

import sys
from random import random
from operator import add

from pyspark.sql import SparkSession

if __name__ == "__main__":
    """
        Usage: pi [partitions]
    """
    spark = SparkSession\
        .builder\
        .appName("PythonPi")\
        .getOrCreate()

partitions = int(sys.argv[1]) if len(sys.argv) > 1 else 2
    n = 100000 * partitions

def f(_):
        x = random() * 2 - 1
        y = random() * 2 - 1
        return 1 if x ** 2 + y ** 2 <= 1 else 0

count = spark.sparkContext.parallelize(range(1, n + 1), partitions).map(f).reduce(add)
    print("Pi is roughly %f" % (4.0 * count / n))

spark.stop()
```
]

---
# Useful links

* [Quora: Hadoop vs. Spark](https://www.quora.com/What-is-the-difference-between-Hadoop-and-Spark)

* [Spark basics](https://data-flair.training/blogs/what-is-spark/)

* [Spark workshop slides](https://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf)

* [Hadoop wikipedia page](https://en.wikipedia.org/wiki/Apache_Hadoop)

* [MapReduce wikipedia page](https://en.wikipedia.org/wiki/MapReduce)

* [SparkR demo](https://rpubs.com/wendyu/sparkr)