Author
sdeeg
View
355
Download
3
Embed Size (px)
DESCRIPTION
An overview of the Apache Spark project from the perspective of a Java programmer. Topics: What is Spark, Spark Programming Model, Spark eco-system, 1.0 release and why it's a huge milestone.
1 © Copyright 2013 Pivotal. All rights reserved. 1 © Copyright 2013 Pivotal. All rights reserved.
Intro to Apache Spark A primer for POJGs (Plain Old Java Geeks)
Scott Deeg: Sr. Field Engineer [email protected]
2 © Copyright 2013 Pivotal. All rights reserved.
Agenda � Intro: Agenda, it’s all about ME!, 10 seconds on Pivotal
� What is Spark, and what does it have to do with BigData/Hadoop? – Ecosystem (Shark, Streaming, MLlib, GraphX)
� Spark Programming Model – Demo: interactive shell
� Related Projects
� Spark 1.0
� More Tech: WordCount, TicTacToe – dev experience, Java8
� Deployment Topologies – Simple Cluster Demo
3 © Copyright 2013 Pivotal. All rights reserved.
Who Am I?
Just a Plain Old Java Guy
� Java since 1996, Symantec Visual Café 1.0
� Random consulting around Si Valley
� Hacker on Java based BPM product for 10 years
� Joined VMW 2009 when they acquired SpringSource
� Rolled into Pivotal April 1 2013
4 © Copyright 2013 Pivotal. All rights reserved.
What is Pivotal?
� Cloud, Big Data, Fast Data, Modern Apps
� Technology Bets – HDFS will be the way we talk to Enterprise data repositories
▪ Consolidate Silos in “Data Lake” ▪ Eco-system of services will arise to utilize HDFS data
– PaaS will manage the Application Life Cycle – OSS will be the basis for solutions – Cloud Architecture
▪ Distributed / Parallel ▪ CPU, Memory, Network … storage is a distributed service
5 © Copyright 2013 Pivotal. All rights reserved.
Data Sources
Application Platform
Stream Server
IMDG ASF Services
MPP SQL
HDFS
Pivotal Platform
SQL Objects JSON GemFireXD
...ETC
End Users Developers AppOps
6 © Copyright 2013 Pivotal. All rights reserved. 6 © Copyright 2013 Pivotal. All rights reserved.
What Is Spark? Hint: It’s all about the RDD
7 © Copyright 2013 Pivotal. All rights reserved.
?
� Is it “Big Data”
� Is it “Hadoop”
� It’s one of those “in memory” things, right
� JVM, Java, Scala
� Is it Real or just another shiny technology with a long, but ultimately small tail
8 © Copyright 2013 Pivotal. All rights reserved.
Spark is … � Distributed/Cluster Compute Execution Engine – Came out of AMPLab project at UCB, now ASF top level project
� Designed to work with data in memory
� Similar scalability and fault tolerance as Hadoop Map/Reduce – Utilizes Lineage to reconstitute data instead of replication
� Generalization of Map/Reduce – Implementation of Resilient Distributed Dataset (RDD)
� Programmatic or Interactive
� Written in Scala
9 © Copyright 2013 Pivotal. All rights reserved.
Spark is also … � An ASF Top Level project
� Has ~100 contributors across 25 companies – More active than Hadoop MapReduce
� An eco-system of domain specific tools – Different models, but mostly interoperable
� Hadoop Compatible
10 © Copyright 2013 Pivotal. All rights reserved.
Berkley Data Analytics Stack (BDAS)
Support
� Batch
� Streaming
� Interactive
Make it easy to compose them
11 © Copyright 2013 Pivotal. All rights reserved.
Short History � 2009 Started as research project at UCB
� 2010 Open Sourced
� January 2011 AMPLab Created
� October 2012 0.6 – Java, Stand alone cluster, maven
� June 21 2013 Spark accepted into ASF Incubator
� Feb 27 2014 Spark becomes top level ASF project
� May 30 2014 Spark 1.0
12 © Copyright 2013 Pivotal. All rights reserved.
Spark Philosophy
� Make life easy and productive for Data Scientists
� Provide well documented and expressive APIs
� Powerful Domain Specific Libraries
� Easy integration with storage systems
� Caching to avoid data movement (performance)
� Well defined releases, stable API
13 © Copyright 2013 Pivotal. All rights reserved.
Spark is not Hadoop, but is compatible
� Often better than Hadoop (Eric Baldeschwieler) – M/R fine for “Data Parallel”, but awkward for some workloads – Low latency dispatch, Iterative, Streaming
� Natively accesses Hadoop data
� Spark just another YARN job – Maintains huge investment in data collection – Brings Spark to the Data
� It’s not OR … it’s AND!
14 © Copyright 2013 Pivotal. All rights reserved.
Improvements over Map/Reduce
� Efficiency – General Execution Graphs (not just map->reduce->store) – In memory
� Usability – Rich APIs in Scala, Java, Python – Interactive
� Can Spark be the R for Big Data?
15 © Copyright 2013 Pivotal. All rights reserved. 15 © Copyright 2013 Pivotal. All rights reserved.
Spark Programming Model RDDs in Detail
16 © Copyright 2013 Pivotal. All rights reserved.
Core Concept
Think of a program as a set of transformations on a Distributed Dataset
Model: Resilient Distributed Dataset (RDD) – Read Only Collection of Objects spread across a cluster – RDDs are built through parallel transformations (map, filter, etc.) – Automatically rebuilt on failure using lineage – Controllable persistence (RAM, HDFS, etc.)
17 © Copyright 2013 Pivotal. All rights reserved.
Operations
� Create – From stable storage (hdfs)
� Transform – Generate RDD from other RDD (map, filter, groupBy) – Lazy Operations that build a DAG – Once Spark knows your transformations it can build an efficient plan
� Action – Return a result or write to storage (count, collect, reduce, save)
18 © Copyright 2013 Pivotal. All rights reserved.
Demo: Log Mining
� Scala shell
� Load file from HDFS
� Search for patterns
19 © Copyright 2013 Pivotal. All rights reserved.
Transformation and Actions
� Transformations – Map – filter – flatMap – sample – groupByKey – reduceByKey – union – join – sort
� Actions – count – collect – reduce – lookup – save
20 © Copyright 2013 Pivotal. All rights reserved.
RDD Fault Tolerance
� RDDs maintain lineage information that can be used to reconstruct lost partitions
cachedMsgs = textFile(...).filter(_.contains(“error”)) .map(_.split(‘\t’)(2)) .cache()
HdfsRDD path: hdfs://…
FilteredRDD func: contains(...)
MappedRDD func: split(…) CachedRDD
21 © Copyright 2013 Pivotal. All rights reserved.
RDDs are Foundational
� General purpose enough to use to implement other programing models – SQL – Graph – ML – MR
22 © Copyright 2013 Pivotal. All rights reserved. 22 © Copyright 2013 Pivotal. All rights reserved.
Related Projects Things that run on Spark
23 © Copyright 2013 Pivotal. All rights reserved.
Related Projects
� Shark
� Spark SQL
� Spark Streaming
� GraphX
� MLbase
� Others
24 © Copyright 2013 Pivotal. All rights reserved.
Shark
� Hive on Spark – HiveQL, UDFs, etc.
� Turn SQL into RDD – Part of the lineage
� Based on Hive, but takes advantage of Spark for – Fast Scheduling – Queries are DAGs of jobs, not chained M/R – Fast broadcast variables
© Apache Software Foundation
25 © Copyright 2013 Pivotal. All rights reserved.
Shark (cont)
� Optimized Columnar Storage format
� Fast/Efficient Compression – From Yahoo! – Able to hold 3-20x more data in same cluster
� Various other optimizations using partitioning
� Will ultimately run on Spark SQL – No Hive dependencies except to accessing Hive datastore – Long running process with management tools
26 © Copyright 2013 Pivotal. All rights reserved.
Spark SQL
� Lib in Spark Core to treat RDDs as relations – SchemaRDD
� Lighter weight version of Shark – No code from Hive
� Import/Export in different Storage formats – Parquet, learn schema from existing Hive warehouse
� Takes columnar storage from Shark
27 © Copyright 2013 Pivotal. All rights reserved.
Spark SQL Code
� Go take a look
28 © Copyright 2013 Pivotal. All rights reserved.
Spark Streaming
� Extend Spark to do large scale stream processing – 100s of nodes and second scale end to end latency
� Stateful Processing – Hard to make FT – Storm: requires idempotent updates
� Simple, batch like API with RDDs
� Single semantics for both real time and high latency
29 © Copyright 2013 Pivotal. All rights reserved.
Streaming (cont)
� Input is broken up into Batches that become RDDs
� RDD’s are composed into DAGs to generate output
� Raw data is replicated in-memory for FT
30 © Copyright 2013 Pivotal. All rights reserved.
Streaming (cont)
� Other features – Window-based Transformations – Arbitrary join of streams
31 © Copyright 2013 Pivotal. All rights reserved.
GraphX (Alpha)
� Graph processing – Replaces Spark Bagel
� Graph Parallel not Data Parallel – Reason in the context of neighbors – GraphLab API
32 © Copyright 2013 Pivotal. All rights reserved.
GraphX (cont) � Predicting things about people (eg: political bias) – Look at posts, apply classifier, try to predict attribute – Local signal is difficult alone – Look at context of social network to improve prediction
� Triangle processing – More triangles reveals greater community
� Collaborative Filtering – Bi-partide graph processing – What I like, who rated those things, what they like => what I may like
33 © Copyright 2013 Pivotal. All rights reserved.
GraphX (cont)
� Graph Creation => Algorithm => Post Processing – Existing systems mainly deal with the Algorithm and not interactive – Unify collection and graph models
� Graphs have – Vertices, edges – Transformation: reverse, filter, map – Joins: graphs and tables – Aggregate Neighbors
34 © Copyright 2013 Pivotal. All rights reserved.
MLbase
� Machine Learning toolset – Library and higher level abstractions
� General tool is MatLab – Difficult for end users to learn, debug, scale solutions
� Starting with MLlib – Low level Distributed Machine Learning Library
� Many different Algorithms – Classification, Regression, Collaborative Filtering, etc.
35 © Copyright 2013 Pivotal. All rights reserved.
Others
� Mesos – Enable multiple frameworks to share same cluster resources – Twitter is largest user: Over 6,000 servers
� Tachyon – In-memory, fault tolerant file system that exposes HDFS
� Catalyst – SQL Query Optimizer
36 © Copyright 2013 Pivotal. All rights reserved. 36 © Copyright 2013 Pivotal. All rights reserved.
Spark 1.0
37 © Copyright 2013 Pivotal. All rights reserved.
Release cycle
� 1.0 Came out at end of May
� 1.X expected to be current for several years
� Quarterly release cycle – 2 mo dev / 1 mo QA – Actual release is based on vote
� 1.1 due end of August
38 © Copyright 2013 Pivotal. All rights reserved.
1.0
� API Stability in 1.X for all non-Alpha projects – Can recompile jobs, but hoping for binary compatibility – Internal API are marked @DeveloperApi or @Experimental
� Focus: Core Engine, Streaming, MLLib, SparkSQL
� History Server for Spark UI – Driving development of instrumentation
� Job Submission Tool – Don’t configure Context in code (eg: master)
39 © Copyright 2013 Pivotal. All rights reserved.
1.0
� Java8 Lamdas – No more writing closures as Classes – Functions are interfaces – Return type sensitive functions
▪ mapToPair
� Python improvements
40 © Copyright 2013 Pivotal. All rights reserved.
1.0
� Hadoop security – Kerberos, ACL for UI
� Job cancel from UI
� Distributed GC as things go out of scope – Good for long lives service
� Spark SQL
41 © Copyright 2013 Pivotal. All rights reserved. 41 © Copyright 2013 Pivotal. All rights reserved.
More Code and Demos WordCount, TicTacToe, Java8
42 © Copyright 2013 Pivotal. All rights reserved.
Code Review: WordCount
� Java API
� Java Code
� More usage of RDDs
43 © Copyright 2013 Pivotal. All rights reserved.
TicTacToe: a developers experience
� IDE
� Spring
� Building/Logging
� Debugging
44 © Copyright 2013 Pivotal. All rights reserved.
Demo: Java 8
Lamda Lamda Lamda
45 © Copyright 2013 Pivotal. All rights reserved. 45 © Copyright 2013 Pivotal. All rights reserved.
Deployment Topologies
46 © Copyright 2013 Pivotal. All rights reserved.
Topologies
� Local
� Spark Cluster (master/slaves)
� Cluster Resource Managers – YARN – MESOS
� (PaaS?)
47 © Copyright 2013 Pivotal. All rights reserved.
Demo:
� Start master and slaves
� Show the UI
� Run a Job
� Talk about the History Server
48 © Copyright 2013 Pivotal. All rights reserved. 48 © Copyright 2013 Pivotal. All rights reserved.
This And That
49 © Copyright 2013 Pivotal. All rights reserved.
How Real is Spark?
� There is some criticism – As expected – New project!
� There are many indicators that Spark is heading to success – Solid technology – Good buzz – Significant community
50 © Copyright 2013 Pivotal. All rights reserved.
Next Steps
� Spark website: http://spark.apache.org – Lots’O’Goodstuff
� Spark Summit June 30/July 01 – http://spark-summit.org
51 © Copyright 2013 Pivotal. All rights reserved. 51 © Copyright 2013 Pivotal. All rights reserved.
A NEW PLATFORM FOR A NEW ERA