Upload
dan-lynn
View
640
Download
0
Embed Size (px)
Citation preview
APACHE SPARK: HANDS ON
Andy Grove, Chief Architect Dan Lynn, CEO
FOLLOW ALONG!
• Download IntelliJ Community Edition
• http://tiny.cc/get-intellij
• Snag our example code
• http://tiny.cc/agildata-spark
• git clone [email protected]:codefutures/apache-spark-examples.git
Andy GroveCo-Founder & Chief Architect
Co-Founder @ Orbware Technologies (acquired 2000) Inventor of Firestorm/[email protected]
• Providers of dbShards • Relational Database Scaling
• Big Data Consulting • Data Strategy • Data Architecture Reviews • Big Data Training • Solution Implementation
• Distributed over 6 states! • Headquartered in Broomfield, CO
www.agildata.com
Dan LynnCEO
Co-Founder @ FullContact 15 years building software Techstars [email protected]
AGENDA
• Part I - Overview of Spark
• Motivation, APIs, Ecosystem, Simple Example
• Part 2 - Hands On
• Work through a real data problem
PART 1: AN OVERVIEW OF SPARK
A BRIEF HISTORY LESSON
• First there was Hadoop
• Goal: Process petabytes of constantly-growing data
• “Move the processing to the data”
• But MapReduce was difficult to program
• So they made Pig, Hive, Cascading, etc…
A BRIEF HISTORY LESSON
• MapReduce was also very reliable
• But it performed poorly on iterative tasks like machine learning.
• So in 2009, UC Berkeley started on an new approach
• Keeping data in memory as much as possible.
A BRIEF HISTORY LESSON• They called it “Spark”
• After lots of community acceptance it became an Apache Project in 2013.
• Since then, it has gained mainstream acceptance.
• “Potentially the Most Significant Open Source Project of the Next Decade” - IBM, June 15, 2015
A BRIEF HISTORY LESSON• Huge ecosystem
• Machine learning: MLlib, Mahout
• Graph processing: GraphX
• Read from / write to anything that Hadoop can
• Tons of community contributions: spark-packages.org
• Zeppelin: Python-style interactive notebooks
CONCEPTS
CONCEPTS - RDD
RDD aka “Resilient Distributed Dataset”
your_data
f(your_data)
g(f(your_data))
<— an RDD
<— also an RDD
<— so is this
RDD - SECRET INTERNALS!!!11/** * Tells the Spark framework *where* the data is. */ protected Partition[] getPartitions();
/** * Iterates through the data for a given partition. */ Iterator<T> compute(Partition split, TaskContext context);
RDD - PUBLIC API
• Transformations
• Make new RDDs by applying transformation functions.
• Actions
• Write to HDFS, write to databases, yield an answer, etc…
Two Options
RDD - PUBLIC API
• Transformations
• .map(func) .filter(func) .reduce(func) .flatMap(func)
• Actions
• .collect() .saveAsTextFile(path) .sample(…) .take(n)
EXECUTION MODEL
SPARK EXECUTION MODEL
https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals
What’s this?
SPARK EXECUTION MODEL
• Cluster Managers
• Apache Mesos
• YARN (aka Hadoop 2.0)
• Spark’s native cluster manager
NEW(ER) SPARK APIs
SPARK SQL / DATAFRAME API• New in Spark 1.3. The core engine behind Spark SQL
• If RDDs are transformations that apply to JVM objects…
• Schema (i.e. the class) is passed along with each datum
• Serialization pain. GC pain.
• …then DataFrames are transformations that apply to data
• Schema is defined for the entire set
• Data is transmitted independent of schema. JVM data access incurs much less GC overhead
• DataFrames have more optimized execution logic. i.e. a query planner
DATASET API
• New in Spark 1.6
• Addressed specific deficiencies in DataFrames
• DataFrames lack compile-time type-checking.
• Datasets look like RDDs, but perform like DataFrames
SPARK API CHOICES
Java Scala
RDD
DataFrame sketchy…
Spark SQL
Dataset exciting, but very new exciting, but very new
QUICK EXAMPLE
• Let’s count Shakespeare’s favorite words!
PART 2: HANDS ON
PART 2: HANDS ON• The problem: Rank Colorado counties by gender ratio.
• The data: US census data from 2010
• The approach:
• RDD API (in both Java 8 and Scala)
• DataFrame API / Spark SQL
• Dataset API
REFERENCES
• http://spark.apache.org/research.html
• http://tiny.cc/agildata-spark
• http://spark-packages.org
Andy GroveCo-Founder & Chief Architect
@andygrove73
www.agildata.com
Dan LynnCEO
@danklynn
Thanks!