Making Big Data Processing Simple with Sparkpapaggel/courses/eecs4415/...Making Big Data Processing...

Making Big Data Processing Simple with Spark

Matei Zaharia December 17, 2015

What is Apache Spark?

Fast and general cluster computing engine that generalizes the MapReduce model

Makes it easy and fast to process large datasets • High-level APIs in Java, Scala, Python, R • Unified engine that can capture many workloads

A Unified Engine

Spark Streaming

real-time

Spark SQL structured data

MLlib machine learning

GraphX graph

0 20 40 60 80

100 120 140 160

2010 2011 2012 2013 2014 2015

Contributors / Month to Spark

A Large Community

Most active open source project for big data

Overview

Why a unified engine?

Spark programming model

Built-in libraries

Applications

History: Cluster Computing 2004

A general engine for batch processing

MapReduce

Beyond MapReduce

MapReduce was great for batch processing, but users quickly needed to do more: • More complex, multi-pass algorithms • More interactive ad-hoc queries • More real-time stream processing

Result: specialized systems for these workloads

MapReduce

Pregel

Dremel

Presto

Giraph

Impala

S4 . . .

Specialized systems for new workloads

General batch processing

Big Data Systems Today

Problems with Specialized Systems

More systems to manage, tune, deploy

Can’t easily combine processing types • Even though most applications need to do this! • E.g. load data with SQL, then run machine learning

In many cases, data transfer between engines is a dominant cost!

MapReduce

Pregel

Dremel

Presto

Giraph

Impala

Specialized systems for new workloads

General batch processing

Unified engine

Big Data Systems Today

? . . .

Overview

Built-in libraries

Applications

Background

Recall 3 workloads were issues for MapReduce: • More complex, multi-pass algorithms • More interactive ad-hoc queries • More real-time stream processing

While these look different, all 3 need one thing that MapReduce lacks: efficient data sharing

Data Sharing in MapReduce

iter. 1 iter. 2 . . .

HDFS read

HDFS write

HDFS read

HDFS write

query 1

query 2

query 3

result 1

result 2

result 3

HDFS read

Slow due to replication and disk I/O

iter. 1 iter. 2 . . .

What We’d Like

Distributed memory

query 1

query 2

query 3

one-time processing

10-100x faster than network and disk

Spark Programming Model

Resilient Distributed Datasets (RDDs) • Collections of objects stored in RAM or disk across cluster • Built via parallel transformations (map, filter, …) • Automatically rebuilt on failure

Example: Log Mining

Load error messages from a log into memory, then interactively search for various patterns

lines=spark.textFile(“hdfs://...”)

errors=lines.filter(lambdas:s.startswith(“ERROR”))

messages=errors.map(lambdas:s.split(‘\t’)[2])

messages.cache()Block 1

Block 2

Block 3

Worker

Driver

messages.filter(lambdas:“MySQL”ins).count()

messages.filter(lambdas:“Redis”ins).count()

results Cache 1

Cache 2

Cache 3

Base RDD Transformed RDD

Action

Example: full-text search of Wikipedia in 0.5 sec (vs 20s for on-disk data)

Fault Tolerance

file.map(lambda rec: (rec.type, 1)) .reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10)

filter reduce map

t file

RDDs track lineage info to rebuild lost data

filter reduce map

t file

Fault Tolerance

file.map(lambda rec: (rec.type, 1)) .reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10)

RDDs track lineage info to rebuild lost data

Example: Logistic Regression

1 5 10 20 30

Number of Iterations

Hadoop

110 s / iteration

first iteration 80 s further iterations 1 s

Source: Daytona GraySort benchmark, sortbenchmark.org

2100 machines 2013 Record: Hadoop

72 minutes

2014 Record: Spark

207 machines

23 minutes

On-Disk Performance Time to sort 100TB

Libraries Built on Spark

Spark Streaming

real-time

Spark SQL structured data

MLlib machine learning

GraphX graph

// Load data using SQL points = ctx.sql(“select latitude, longitude from tweets”)

// Train a machine learning model model = KMeans.train(points, 10)

// Apply it to a stream sc.twitterStream(...) .map(lambda t: (model.predict(t.location), 1)) .reduceByWindow(“5s”, lambda a, b: a + b)

Combining Processing Types

Separate systems:

HDFS read

HDFS write ET

L HDFS read

HDFS write tr

ain HDFS

read HDFS write qu

HDFS write

HDFS read ET

Spark:

Performance vs Specialized Systems

Streaming

Some Recent Additions

DataFrame API (similar to R and Pandas) • Easy programmatic way to work with structured data

R interface (SparkR)

Machine learning pipelines (like SciKit-learn)

Overview

Built-in libraries

Applications

Over 1000 deployments, clusters up to 8000 nodes

Spark Community

Many talks online at spark-summit.org

Top Applications

Faud Detection / Security

User-Facing Services

Log Processing

Recommendation

Data Warehousing

Business Intelligence

Spark Components Used

MLlib + GraphX

Spark Streaming

DataFrames

Spark SQL

of users use more than one component

Learn More

Get started on your laptop: spark.apache.org

Resources and MOOCs: sparkhub.databricks.com

Spark Summit: spark-summit.org

Thank You

Making Big Data Processing Simple with Sparkpapaggel/courses/eecs4415/...Making Big Data Processing...

Documents

Intuitive Songwriting: 8 Simple Ways To Spark Your Best Song Ideas

PROFILING SPARK APPLICATIONS - Schedschd.ws/.../87/Apache-Big-Data-2017-Spark-Profiling.pdf · 5 SUMMARY FOR SPARK APPS? BUT SPARK APPS ARE A MIXED BAG... • Simple batch to complex

Spark & Spark SQL

Spark streaming , Spark SQL

Stream Processing Platforms Data - eecs.yorku.capapaggel/courses/eecs4415/docs/lectures/04... · Elements: Data engine software for optimized operations HW design pattern for balancing

Spark SQL | Apache Spark

Stream Processing Platforms Data - eecs.yorku.capapaggel/courses/eecs4415/docs/lectures/09... · An Iceberg-Cube contains only those cells of the data cube that meet an aggregate

IBM Spark Meetup - RDD & Spark Basics

As simple as Apache Spark

[Spark meetup] Spark Streaming Overview

Data Driven Organizations - York Universitypapaggel/courses/eecs4415/...Data Driven Organizations Reference Model for DDO Solutions Intuition Ad-hoc or based on few customers feedback

Spark SQL and DataFrames Spark GraphX Spark Mlib Spark ...Spark GraphX! Spark Mlib! Spark Streaming Lightning-fast cluster computing. Chaining transformations 2. ... Covert RDD to

Spark Plug Thread Repair Spark Plug Spark Plug Sockets for Ford

Spark Summit 2015 keynote: Making Big Data Simple with Spark

Spark Instructions - Simple Noah's Ark

EDISON Data Science Framework (EDSF): Education and ...€¦ · •Spark: A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model

Kerberizing spark. Spark Summit east

CPi - Simple Digital System EM-5 System Description The CPi unit is a microprocessor based ignition unit designed to trigger the supplied waste-spark type coil packs. Engine spark

Automating Cisco Spark with Cloud Integration...Development Opportunity Spectrum Full Development Practice Full Development (outsourced) Light Development (Simple workflows) Spark

Spark summit2014 techtalk - testing spark