26
APACHE SPARK: HANDS ON Andy Grove, Chief Architect Dan Lynn, CEO

Hands on with Apache Spark

Embed Size (px)

Citation preview

Page 1: Hands on with Apache Spark

APACHE SPARK: HANDS ON

Andy Grove, Chief Architect Dan Lynn, CEO

Page 2: Hands on with Apache Spark

FOLLOW ALONG!

• Download IntelliJ Community Edition

• http://tiny.cc/get-intellij

• Snag our example code

• http://tiny.cc/agildata-spark

• git clone [email protected]:codefutures/apache-spark-examples.git

Page 3: Hands on with Apache Spark

Andy GroveCo-Founder & Chief Architect

Co-Founder @ Orbware Technologies (acquired 2000) Inventor of Firestorm/[email protected]

• Providers of dbShards • Relational Database Scaling

• Big Data Consulting • Data Strategy • Data Architecture Reviews • Big Data Training • Solution Implementation

• Distributed over 6 states! • Headquartered in Broomfield, CO

www.agildata.com

Dan LynnCEO

Co-Founder @ FullContact 15 years building software Techstars [email protected]

Page 4: Hands on with Apache Spark

AGENDA

• Part I - Overview of Spark

• Motivation, APIs, Ecosystem, Simple Example

• Part 2 - Hands On

• Work through a real data problem

Page 5: Hands on with Apache Spark

PART 1: AN OVERVIEW OF SPARK

Page 6: Hands on with Apache Spark

A BRIEF HISTORY LESSON

• First there was Hadoop

• Goal: Process petabytes of constantly-growing data

• “Move the processing to the data”

• But MapReduce was difficult to program

• So they made Pig, Hive, Cascading, etc…

Page 7: Hands on with Apache Spark

A BRIEF HISTORY LESSON

• MapReduce was also very reliable

• But it performed poorly on iterative tasks like machine learning.

• So in 2009, UC Berkeley started on an new approach

• Keeping data in memory as much as possible.

Page 8: Hands on with Apache Spark

A BRIEF HISTORY LESSON• They called it “Spark”

• After lots of community acceptance it became an Apache Project in 2013.

• Since then, it has gained mainstream acceptance.

• “Potentially the Most Significant Open Source Project of the Next Decade” - IBM, June 15, 2015

Page 9: Hands on with Apache Spark

A BRIEF HISTORY LESSON• Huge ecosystem

• Machine learning: MLlib, Mahout

• Graph processing: GraphX

• Read from / write to anything that Hadoop can

• Tons of community contributions: spark-packages.org

• Zeppelin: Python-style interactive notebooks

Page 10: Hands on with Apache Spark

CONCEPTS

Page 11: Hands on with Apache Spark

CONCEPTS - RDD

RDD aka “Resilient Distributed Dataset”

your_data

f(your_data)

g(f(your_data))

<— an RDD

<— also an RDD

<— so is this

Page 12: Hands on with Apache Spark

RDD - SECRET INTERNALS!!!11/** * Tells the Spark framework *where* the data is. */ protected Partition[] getPartitions();

/** * Iterates through the data for a given partition. */ Iterator<T> compute(Partition split, TaskContext context);

Page 13: Hands on with Apache Spark

RDD - PUBLIC API

• Transformations

• Make new RDDs by applying transformation functions.

• Actions

• Write to HDFS, write to databases, yield an answer, etc…

Two Options

Page 14: Hands on with Apache Spark

RDD - PUBLIC API

• Transformations

• .map(func) .filter(func) .reduce(func) .flatMap(func)

• Actions

• .collect() .saveAsTextFile(path) .sample(…) .take(n)

Page 15: Hands on with Apache Spark

EXECUTION MODEL

Page 16: Hands on with Apache Spark

SPARK EXECUTION MODEL

https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals

What’s this?

Page 17: Hands on with Apache Spark

SPARK EXECUTION MODEL

• Cluster Managers

• Apache Mesos

• YARN (aka Hadoop 2.0)

• Spark’s native cluster manager

Page 18: Hands on with Apache Spark

NEW(ER) SPARK APIs

Page 19: Hands on with Apache Spark

SPARK SQL / DATAFRAME API• New in Spark 1.3. The core engine behind Spark SQL

• If RDDs are transformations that apply to JVM objects…

• Schema (i.e. the class) is passed along with each datum

• Serialization pain. GC pain.

• …then DataFrames are transformations that apply to data

• Schema is defined for the entire set

• Data is transmitted independent of schema. JVM data access incurs much less GC overhead

• DataFrames have more optimized execution logic. i.e. a query planner

Page 20: Hands on with Apache Spark

DATASET API

• New in Spark 1.6

• Addressed specific deficiencies in DataFrames

• DataFrames lack compile-time type-checking.

• Datasets look like RDDs, but perform like DataFrames

Page 21: Hands on with Apache Spark

SPARK API CHOICES

Java Scala

RDD

DataFrame sketchy…

Spark SQL

Dataset exciting, but very new exciting, but very new

Page 22: Hands on with Apache Spark

QUICK EXAMPLE

• Let’s count Shakespeare’s favorite words!

Page 23: Hands on with Apache Spark

PART 2: HANDS ON

Page 24: Hands on with Apache Spark

PART 2: HANDS ON• The problem: Rank Colorado counties by gender ratio.

• The data: US census data from 2010

• The approach:

• RDD API (in both Java 8 and Scala)

• DataFrame API / Spark SQL

• Dataset API

Page 25: Hands on with Apache Spark

REFERENCES

• http://spark.apache.org/research.html

• http://tiny.cc/agildata-spark

• http://spark-packages.org

Page 26: Hands on with Apache Spark

Andy GroveCo-Founder & Chief Architect

[email protected]

@andygrove73

www.agildata.com

Dan LynnCEO

[email protected]

@danklynn

Thanks!