Hands on with Apache Spark

APACHE SPARK: HANDS ON

Andy Grove, Chief Architect Dan Lynn, CEO

FOLLOW ALONG!

• Download IntelliJ Community Edition

• http://tiny.cc/get-intellij

• Snag our example code

• http://tiny.cc/agildata-spark

• git clone [email protected]:codefutures/apache-spark-examples.git

http://tiny.cc/get-intellij

http://tiny.cc/agildata-spark

Andy GroveCo-Founder & Chief Architect

Co-Founder @ Orbware Technologies (acquired 2000) Inventor of Firestorm/[email protected]

• Providers of dbShards • Relational Database Scaling

• Big Data Consulting • Data Strategy • Data Architecture Reviews • Big Data Training • Solution Implementation

• Distributed over 6 states! • Headquartered in Broomfield, CO

www.agildata.com

Dan LynnCEO

Co-Founder @ FullContact 15 years building software Techstars [email protected]

AGENDA

• Part I - Overview of Spark

• Motivation, APIs, Ecosystem, Simple Example

• Part 2 - Hands On

• Work through a real data problem

PART 1: AN OVERVIEW OF SPARK

A BRIEF HISTORY LESSON

• First there was Hadoop

• Goal: Process petabytes of constantly-growing data

• “Move the processing to the data”

• But MapReduce was difficult to program

• So they made Pig, Hive, Cascading, etc…

A BRIEF HISTORY LESSON

• MapReduce was also very reliable

• But it performed poorly on iterative tasks like machine learning.

• So in 2009, UC Berkeley started on an new approach

• Keeping data in memory as much as possible.

A BRIEF HISTORY LESSON• They called it “Spark”

• After lots of community acceptance it became an Apache Project in 2013.

• Since then, it has gained mainstream acceptance.

• “Potentially the Most Significant Open Source Project of the Next Decade” - IBM, June 15, 2015

A BRIEF HISTORY LESSON• Huge ecosystem

• Machine learning: MLlib, Mahout

• Graph processing: GraphX

• Read from / write to anything that Hadoop can

• Tons of community contributions: spark-packages.org

• Zeppelin: Python-style interactive notebooks

CONCEPTS

CONCEPTS - RDD

RDD aka “Resilient Distributed Dataset”

your_data

f(your_data)

g(f(your_data))

<— an RDD

<— also an RDD

<— so is this

RDD - SECRET INTERNALS!!!11/** * Tells the Spark framework *where* the data is. */ protected Partition[] getPartitions();

/** * Iterates through the data for a given partition. */ Iterator<T> compute(Partition split, TaskContext context);

RDD - PUBLIC API

• Transformations

• Make new RDDs by applying transformation functions.

• Actions

• Write to HDFS, write to databases, yield an answer, etc…

Two Options

RDD - PUBLIC API

• Transformations

• .map(func) .filter(func) .reduce(func) .flatMap(func)

• Actions

• .collect() .saveAsTextFile(path) .sample(…) .take(n)

EXECUTION MODEL

SPARK EXECUTION MODEL

https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals

What’s this?

https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals

SPARK EXECUTION MODEL

• Cluster Managers

• Apache Mesos

• YARN (aka Hadoop 2.0)

• Spark’s native cluster manager

NEW(ER) SPARK APIs

SPARK SQL / DATAFRAME API• New in Spark 1.3. The core engine behind Spark SQL

• If RDDs are transformations that apply to JVM objects…

• Schema (i.e. the class) is passed along with each datum

• Serialization pain. GC pain.

• …then DataFrames are transformations that apply to data

• Schema is defined for the entire set

• Data is transmitted independent of schema. JVM data access incurs much less GC overhead

• DataFrames have more optimized execution logic. i.e. a query planner

DATASET API

• New in Spark 1.6

• Addressed specific deficiencies in DataFrames

• DataFrames lack compile-time type-checking.

• Datasets look like RDDs, but perform like DataFrames

SPARK API CHOICES

Java Scala

RDD

DataFrame sketchy…

Spark SQL

Dataset exciting, but very new exciting, but very new

QUICK EXAMPLE

• Let’s count Shakespeare’s favorite words!

PART 2: HANDS ON

PART 2: HANDS ON• The problem: Rank Colorado counties by gender ratio.

• The data: US census data from 2010

• The approach:

• RDD API (in both Java 8 and Scala)

• DataFrame API / Spark SQL

• Dataset API

REFERENCES

• http://spark.apache.org/research.html

• http://tiny.cc/agildata-spark

• http://spark-packages.org

http://spark.apache.org/research.html

http://tiny.cc/agildata-spark

http://spark-packages.org

Andy GroveCo-Founder & Chief Architect

[email protected]

@andygrove73

www.agildata.com

Dan LynnCEO

[email protected]

@danklynn

Thanks!

Data & Analytics

Hands on with Apache Spark