51
Intro to Apache Spark A primer for POJGs (Plain Old Java Geeks) Scott Deeg: Sr. Field Engineer [email protected]

Spark For Plain Old Java Geeks (June2014 Meetup)

  • Upload
    sdeeg

  • View
    357

  • Download
    3

Embed Size (px)

DESCRIPTION

An overview of the Apache Spark project from the perspective of a Java programmer. Topics: What is Spark, Spark Programming Model, Spark eco-system, 1.0 release and why it's a huge milestone.

Citation preview

Page 1: Spark For Plain Old Java Geeks (June2014 Meetup)

1 © Copyright 2013 Pivotal. All rights reserved. 1 © Copyright 2013 Pivotal. All rights reserved.

Intro to Apache Spark A primer for POJGs (Plain Old Java Geeks)

Scott Deeg: Sr. Field Engineer [email protected]

Page 2: Spark For Plain Old Java Geeks (June2014 Meetup)

2 © Copyright 2013 Pivotal. All rights reserved.

Agenda �  Intro: Agenda, it’s all about ME!, 10 seconds on Pivotal

�  What is Spark, and what does it have to do with BigData/Hadoop? –  Ecosystem (Shark, Streaming, MLlib, GraphX)

�  Spark Programming Model –  Demo: interactive shell

�  Related Projects

�  Spark 1.0

�  More Tech: WordCount, TicTacToe – dev experience, Java8

�  Deployment Topologies –  Simple Cluster Demo

Page 3: Spark For Plain Old Java Geeks (June2014 Meetup)

3 © Copyright 2013 Pivotal. All rights reserved.

Who Am I?

Just a Plain Old Java Guy

�  Java since 1996, Symantec Visual Café 1.0

� Random consulting around Si Valley

� Hacker on Java based BPM product for 10 years

�  Joined VMW 2009 when they acquired SpringSource

� Rolled into Pivotal April 1 2013

Page 4: Spark For Plain Old Java Geeks (June2014 Meetup)

4 © Copyright 2013 Pivotal. All rights reserved.

What is Pivotal?

� Cloud, Big Data, Fast Data, Modern Apps

� Technology Bets –  HDFS will be the way we talk to Enterprise data repositories

▪  Consolidate Silos in “Data Lake” ▪  Eco-system of services will arise to utilize HDFS data

–  PaaS will manage the Application Life Cycle –  OSS will be the basis for solutions –  Cloud Architecture

▪  Distributed / Parallel ▪  CPU, Memory, Network … storage is a distributed service

Page 5: Spark For Plain Old Java Geeks (June2014 Meetup)

5 © Copyright 2013 Pivotal. All rights reserved.

Data Sources

Application Platform

Stream Server

IMDG ASF Services

MPP SQL

HDFS

Pivotal Platform

SQL Objects JSON GemFireXD

...ETC

End Users Developers AppOps

Page 6: Spark For Plain Old Java Geeks (June2014 Meetup)

6 © Copyright 2013 Pivotal. All rights reserved. 6 © Copyright 2013 Pivotal. All rights reserved.

What Is Spark? Hint: It’s all about the RDD

Page 7: Spark For Plain Old Java Geeks (June2014 Meetup)

7 © Copyright 2013 Pivotal. All rights reserved.

?

�  Is it “Big Data”

�  Is it “Hadoop”

�  It’s one of those “in memory” things, right

�  JVM, Java, Scala

�  Is it Real or just another shiny technology with a long, but ultimately small tail

Page 8: Spark For Plain Old Java Geeks (June2014 Meetup)

8 © Copyright 2013 Pivotal. All rights reserved.

Spark is … �  Distributed/Cluster Compute Execution Engine –  Came out of AMPLab project at UCB, now ASF top level project

�  Designed to work with data in memory

�  Similar scalability and fault tolerance as Hadoop Map/Reduce –  Utilizes Lineage to reconstitute data instead of replication

�  Generalization of Map/Reduce –  Implementation of Resilient Distributed Dataset (RDD)

�  Programmatic or Interactive

�  Written in Scala

Page 9: Spark For Plain Old Java Geeks (June2014 Meetup)

9 © Copyright 2013 Pivotal. All rights reserved.

Spark is also … �  An ASF Top Level project

�  Has ~100 contributors across 25 companies –  More active than Hadoop MapReduce

�  An eco-system of domain specific tools –  Different models, but mostly interoperable

�  Hadoop Compatible

Page 10: Spark For Plain Old Java Geeks (June2014 Meetup)

10 © Copyright 2013 Pivotal. All rights reserved.

Berkley Data Analytics Stack (BDAS)

Support

� Batch

� Streaming

�  Interactive

Make it easy to compose them

Page 11: Spark For Plain Old Java Geeks (June2014 Meetup)

11 © Copyright 2013 Pivotal. All rights reserved.

Short History �  2009 Started as research project at UCB

�  2010 Open Sourced

�  January 2011 AMPLab Created

�  October 2012 0.6 –  Java, Stand alone cluster, maven

�  June 21 2013 Spark accepted into ASF Incubator

�  Feb 27 2014 Spark becomes top level ASF project

�  May 30 2014 Spark 1.0

Page 12: Spark For Plain Old Java Geeks (June2014 Meetup)

12 © Copyright 2013 Pivotal. All rights reserved.

Spark Philosophy

� Make life easy and productive for Data Scientists

� Provide well documented and expressive APIs

� Powerful Domain Specific Libraries

� Easy integration with storage systems

� Caching to avoid data movement (performance)

� Well defined releases, stable API

Page 13: Spark For Plain Old Java Geeks (June2014 Meetup)

13 © Copyright 2013 Pivotal. All rights reserved.

Spark is not Hadoop, but is compatible

� Often better than Hadoop (Eric Baldeschwieler) –  M/R fine for “Data Parallel”, but awkward for some workloads –  Low latency dispatch, Iterative, Streaming

� Natively accesses Hadoop data

� Spark just another YARN job –  Maintains huge investment in data collection –  Brings Spark to the Data

�  It’s not OR … it’s AND!

Page 14: Spark For Plain Old Java Geeks (June2014 Meetup)

14 © Copyright 2013 Pivotal. All rights reserved.

Improvements over Map/Reduce

� Efficiency –  General Execution Graphs (not just map->reduce->store) –  In memory

� Usability –  Rich APIs in Scala, Java, Python –  Interactive

� Can Spark be the R for Big Data?

Page 15: Spark For Plain Old Java Geeks (June2014 Meetup)

15 © Copyright 2013 Pivotal. All rights reserved. 15 © Copyright 2013 Pivotal. All rights reserved.

Spark Programming Model RDDs in Detail

Page 16: Spark For Plain Old Java Geeks (June2014 Meetup)

16 © Copyright 2013 Pivotal. All rights reserved.

Core Concept

Think of a program as a set of transformations on a Distributed Dataset

Model: Resilient Distributed Dataset (RDD) –  Read Only Collection of Objects spread across a cluster –  RDDs are built through parallel transformations (map, filter, etc.) –  Automatically rebuilt on failure using lineage –  Controllable persistence (RAM, HDFS, etc.)

Page 17: Spark For Plain Old Java Geeks (June2014 Meetup)

17 © Copyright 2013 Pivotal. All rights reserved.

Operations

� Create –  From stable storage (hdfs)

� Transform –  Generate RDD from other RDD (map, filter, groupBy) –  Lazy Operations that build a DAG –  Once Spark knows your transformations it can build an efficient plan

� Action –  Return a result or write to storage (count, collect, reduce, save)

Page 18: Spark For Plain Old Java Geeks (June2014 Meetup)

18 © Copyright 2013 Pivotal. All rights reserved.

Demo: Log Mining

� Scala shell

� Load file from HDFS

� Search for patterns

Page 19: Spark For Plain Old Java Geeks (June2014 Meetup)

19 © Copyright 2013 Pivotal. All rights reserved.

Transformation and Actions

� Transformations –  Map –  filter –  flatMap –  sample –  groupByKey –  reduceByKey –  union –  join –  sort

� Actions –  count –  collect –  reduce –  lookup –  save

Page 20: Spark For Plain Old Java Geeks (June2014 Meetup)

20 © Copyright 2013 Pivotal. All rights reserved.

RDD Fault Tolerance

� RDDs maintain lineage information that can be used to reconstruct lost partitions

cachedMsgs = textFile(...).filter(_.contains(“error”)) .map(_.split(‘\t’)(2)) .cache()

HdfsRDD path: hdfs://…

FilteredRDD func: contains(...)

MappedRDD func: split(…) CachedRDD

Page 21: Spark For Plain Old Java Geeks (June2014 Meetup)

21 © Copyright 2013 Pivotal. All rights reserved.

RDDs are Foundational

� General purpose enough to use to implement other programing models –  SQL –  Graph –  ML –  MR

Page 22: Spark For Plain Old Java Geeks (June2014 Meetup)

22 © Copyright 2013 Pivotal. All rights reserved. 22 © Copyright 2013 Pivotal. All rights reserved.

Related Projects Things that run on Spark

Page 23: Spark For Plain Old Java Geeks (June2014 Meetup)

23 © Copyright 2013 Pivotal. All rights reserved.

Related Projects

� Shark

� Spark SQL

� Spark Streaming

� GraphX

� MLbase

� Others

Page 24: Spark For Plain Old Java Geeks (June2014 Meetup)

24 © Copyright 2013 Pivotal. All rights reserved.

Shark

� Hive on Spark –  HiveQL, UDFs, etc.

� Turn SQL into RDD –  Part of the lineage

� Based on Hive, but takes advantage of Spark for –  Fast Scheduling –  Queries are DAGs of jobs, not chained M/R –  Fast broadcast variables

© Apache Software Foundation

Page 25: Spark For Plain Old Java Geeks (June2014 Meetup)

25 © Copyright 2013 Pivotal. All rights reserved.

Shark (cont)

� Optimized Columnar Storage format

� Fast/Efficient Compression –  From Yahoo! –  Able to hold 3-20x more data in same cluster

� Various other optimizations using partitioning

� Will ultimately run on Spark SQL –  No Hive dependencies except to accessing Hive datastore –  Long running process with management tools

Page 26: Spark For Plain Old Java Geeks (June2014 Meetup)

26 © Copyright 2013 Pivotal. All rights reserved.

Spark SQL

� Lib in Spark Core to treat RDDs as relations –  SchemaRDD

� Lighter weight version of Shark –  No code from Hive

�  Import/Export in different Storage formats –  Parquet, learn schema from existing Hive warehouse

� Takes columnar storage from Shark

Page 27: Spark For Plain Old Java Geeks (June2014 Meetup)

27 © Copyright 2013 Pivotal. All rights reserved.

Spark SQL Code

� Go take a look

Page 28: Spark For Plain Old Java Geeks (June2014 Meetup)

28 © Copyright 2013 Pivotal. All rights reserved.

Spark Streaming

� Extend Spark to do large scale stream processing –  100s of nodes and second scale end to end latency

� Stateful Processing –  Hard to make FT –  Storm: requires idempotent updates

� Simple, batch like API with RDDs

� Single semantics for both real time and high latency

Page 29: Spark For Plain Old Java Geeks (June2014 Meetup)

29 © Copyright 2013 Pivotal. All rights reserved.

Streaming (cont)

�  Input is broken up into Batches that become RDDs

� RDD’s are composed into DAGs to generate output

� Raw data is replicated in-memory for FT

Page 30: Spark For Plain Old Java Geeks (June2014 Meetup)

30 © Copyright 2013 Pivotal. All rights reserved.

Streaming (cont)

� Other features –  Window-based Transformations –  Arbitrary join of streams

Page 31: Spark For Plain Old Java Geeks (June2014 Meetup)

31 © Copyright 2013 Pivotal. All rights reserved.

GraphX (Alpha)

� Graph processing –  Replaces Spark Bagel

� Graph Parallel not Data Parallel –  Reason in the context of neighbors –  GraphLab API

Page 32: Spark For Plain Old Java Geeks (June2014 Meetup)

32 © Copyright 2013 Pivotal. All rights reserved.

GraphX (cont) �  Predicting things about people (eg: political bias) –  Look at posts, apply classifier, try to predict attribute –  Local signal is difficult alone –  Look at context of social network to improve prediction

�  Triangle processing –  More triangles reveals greater community

�  Collaborative Filtering –  Bi-partide graph processing –  What I like, who rated those things, what they like => what I may like

Page 33: Spark For Plain Old Java Geeks (June2014 Meetup)

33 © Copyright 2013 Pivotal. All rights reserved.

GraphX (cont)

� Graph Creation => Algorithm => Post Processing –  Existing systems mainly deal with the Algorithm and not interactive –  Unify collection and graph models

� Graphs have –  Vertices, edges –  Transformation: reverse, filter, map –  Joins: graphs and tables –  Aggregate Neighbors

Page 34: Spark For Plain Old Java Geeks (June2014 Meetup)

34 © Copyright 2013 Pivotal. All rights reserved.

MLbase

� Machine Learning toolset –  Library and higher level abstractions

� General tool is MatLab –  Difficult for end users to learn, debug, scale solutions

� Starting with MLlib –  Low level Distributed Machine Learning Library

� Many different Algorithms –  Classification, Regression, Collaborative Filtering, etc.

Page 35: Spark For Plain Old Java Geeks (June2014 Meetup)

35 © Copyright 2013 Pivotal. All rights reserved.

Others

� Mesos –  Enable multiple frameworks to share same cluster resources –  Twitter is largest user: Over 6,000 servers

� Tachyon –  In-memory, fault tolerant file system that exposes HDFS

� Catalyst –  SQL Query Optimizer

Page 36: Spark For Plain Old Java Geeks (June2014 Meetup)

36 © Copyright 2013 Pivotal. All rights reserved. 36 © Copyright 2013 Pivotal. All rights reserved.

Spark 1.0

Page 37: Spark For Plain Old Java Geeks (June2014 Meetup)

37 © Copyright 2013 Pivotal. All rights reserved.

Release cycle

� 1.0 Came out at end of May

� 1.X expected to be current for several years

� Quarterly release cycle –  2 mo dev / 1 mo QA –  Actual release is based on vote

� 1.1 due end of August

Page 38: Spark For Plain Old Java Geeks (June2014 Meetup)

38 © Copyright 2013 Pivotal. All rights reserved.

1.0

� API Stability in 1.X for all non-Alpha projects –  Can recompile jobs, but hoping for binary compatibility –  Internal API are marked @DeveloperApi or @Experimental

� Focus: Core Engine, Streaming, MLLib, SparkSQL

� History Server for Spark UI –  Driving development of instrumentation

�  Job Submission Tool –  Don’t configure Context in code (eg: master)

Page 39: Spark For Plain Old Java Geeks (June2014 Meetup)

39 © Copyright 2013 Pivotal. All rights reserved.

1.0

�  Java8 Lamdas –  No more writing closures as Classes –  Functions are interfaces –  Return type sensitive functions

▪  mapToPair

� Python improvements

Page 40: Spark For Plain Old Java Geeks (June2014 Meetup)

40 © Copyright 2013 Pivotal. All rights reserved.

1.0

� Hadoop security –  Kerberos, ACL for UI

�  Job cancel from UI

� Distributed GC as things go out of scope –  Good for long lives service

� Spark SQL

Page 41: Spark For Plain Old Java Geeks (June2014 Meetup)

41 © Copyright 2013 Pivotal. All rights reserved. 41 © Copyright 2013 Pivotal. All rights reserved.

More Code and Demos WordCount, TicTacToe, Java8

Page 42: Spark For Plain Old Java Geeks (June2014 Meetup)

42 © Copyright 2013 Pivotal. All rights reserved.

Code Review: WordCount

�  Java API

�  Java Code

� More usage of RDDs

Page 43: Spark For Plain Old Java Geeks (June2014 Meetup)

43 © Copyright 2013 Pivotal. All rights reserved.

TicTacToe: a developers experience

�  IDE

� Spring

� Building/Logging

� Debugging

Page 44: Spark For Plain Old Java Geeks (June2014 Meetup)

44 © Copyright 2013 Pivotal. All rights reserved.

Demo: Java 8

Lamda Lamda Lamda

Page 45: Spark For Plain Old Java Geeks (June2014 Meetup)

45 © Copyright 2013 Pivotal. All rights reserved. 45 © Copyright 2013 Pivotal. All rights reserved.

Deployment Topologies

Page 46: Spark For Plain Old Java Geeks (June2014 Meetup)

46 © Copyright 2013 Pivotal. All rights reserved.

Topologies

� Local

� Spark Cluster (master/slaves)

� Cluster Resource Managers –  YARN –  MESOS

�  (PaaS?)

Page 47: Spark For Plain Old Java Geeks (June2014 Meetup)

47 © Copyright 2013 Pivotal. All rights reserved.

Demo:

� Start master and slaves

� Show the UI

� Run a Job

� Talk about the History Server

Page 48: Spark For Plain Old Java Geeks (June2014 Meetup)

48 © Copyright 2013 Pivotal. All rights reserved. 48 © Copyright 2013 Pivotal. All rights reserved.

This And That

Page 49: Spark For Plain Old Java Geeks (June2014 Meetup)

49 © Copyright 2013 Pivotal. All rights reserved.

How Real is Spark?

� There is some criticism –  As expected –  New project!

� There are many indicators that Spark is heading to success –  Solid technology –  Good buzz –  Significant community

Page 50: Spark For Plain Old Java Geeks (June2014 Meetup)

50 © Copyright 2013 Pivotal. All rights reserved.

Next Steps

� Spark website: http://spark.apache.org –  Lots’O’Goodstuff

� Spark Summit June 30/July 01 –  http://spark-summit.org

Page 51: Spark For Plain Old Java Geeks (June2014 Meetup)

51 © Copyright 2013 Pivotal. All rights reserved. 51 © Copyright 2013 Pivotal. All rights reserved.

A NEW PLATFORM FOR A NEW ERA