Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More

Spark Camp @ Strata CA An Intro to Apache Spark with Hands-on Tutorials

Wed Feb 18, 2015 9:00am–5:00pm

strataconf.com/big-data-conference-ca-2015/

http://strataconf.com/big-data-conference-ca-2015/public/schedule/detail/38254

http://spark.apache.org/


















































































































Spark Camp @ Strata + Hadoop World

A day long hands-on introduction to the Spark platform including Spark Core, the Spark Shell, Spark Streaming, Spark SQL, MLlib, GraphX, and more…

• overview of use cases and demonstrate writing simple Spark applications

• cover each of the main components of the Spark stack

• a series of technical talks targeted at developers who are new to Spark

• intermixed with the talks will be periods of hands-on lab work

https://twitter.com/CjBayesian/status/522912893927710720

Spark Camp @ Strata + Hadoop World

Strata NY @ NYC2014-10-15 ~450 people

Strata EU @ Barcelona2014-11-19 ~250 people

Spark Camp: Ask Us Anything

Fri, Feb 20 2:20pm-3:00pm strataconf.com/big-data-conference-ca-2015/public/schedule/detail/40701

Join the Spark team for an informal question and answer session. Several of the Spark committers, trainers, etc., from Databricks will be on hand to field a wide range of detailed questions.

Even if you don’t have a specific question, join in to hear what others are asking!


Apache Spark Advanced Training

Feb 17-19 9:00am-5:00pm strataconf.com/big-data-conference-ca-2015/public/schedule/detail/39399

Sameer Farooqui leads this new 3-day training program offered by Databricks and O’Reilly Media at Strata + Hadoop World events worldwide.

Participants will also receive limited free-tier accounts on Databricks Cloud.

Note: this sold out early, so if you want to attend it at Strata EU, sign up quickly!


http://databricks.com/product

Spark Developer Certification

Fri Feb 20, 2015 10:40am-12:40pm

• http://oreilly.com/go/sparkcert

• defined by Spark experts @Databricks

• assessed by O’Reilly Media

• establishes the bar for Spark expertise

http://oreilly.com/go/sparkcert

• 40 multiple-choice questions, 90 minutes

• mostly structured as choices among code blocks

• expect some Python, Java, Scala, SQL

• understand theory of operation

• identify best practices

• recognize code that is more parallel, less memory constrained

!

Overall, you need to write Spark apps in practice

Developer Certification: Overview

7

Even More Apache Spark!

Feb 17-20, 2015



















































































































Keynote: New Directions for Spark in 2015

Fri Feb 20 9:15am-9:25am strataconf.com/big-data-conference-ca-2015/public/schedule/detail/39547

As the Apache Spark userbase grows, the developer community is working to adapt it for ever-wider use cases. 2014 saw fast adoption of Spark in the enterprise and major improvements in its performance, scalability and standard libraries. In 2015, we want to make Spark accessible to a wider set of users, through new high-level APIs for data science: machine learning pipelines, data frames, and R language bindings. In addition, we are defining extension points to let Spark grow as a platform, making it easy to plug in data sources, algorithms, and external packages. Like all work on Spark, these APIs are designed to plug seamlessly into Spark applications, giving users a unified platform for streaming, batch and interactive data processing.

Matei Zaharia – started the Spark project at UC Berkeley, currently CTO of Databricks, Spark VP at Apache, and an assistant professor at MIT


Databricks Spark Talks @ Strata + Hadoop World

Thu Feb 19 10:40am-11:20am strataconf.com/big-data-conference-ca-2015/public/schedule/detail/38237

Lessons from Running Large Scale Spark Workloads Reynold Xin, Matei Zaharia

Thu Feb 19 4:00pm–4:40pm strataconf.com/big-data-conference-ca-2015/public/schedule/detail/38518

Spark Streaming - The State of the Union, and Beyond Tathagata Das



Databricks Spark Talks @ Strata + Hadoop World

Fri Feb 20 11:30am-12:10pm strataconf.com/big-data-conference-ca-2015/public/schedule/detail/38237

Tuning and Debugging in Apache Spark Patrick Wendell

Fri Feb 20 4:00pm–4:40pm strataconf.com/big-data-conference-ca-2015/public/schedule/detail/38391

Everyday I’m Shuffling - Tips for Writing Better Spark Programs Vida Ha, Holden Karau



A Brief History

13

A Brief History: Functional Programming for Big Data

circa late 1990s: explosive growth e-commerce and machine data implied that workloads could not fit on a single computer anymore…

notable firms led the shift to horizontal scale-out on clusters of commodity hardware, especially for machine learning use cases at scale

14

A Brief History: Functional Programming for Big Data

2002

2002MapReduce @ Google

2004MapReduce paper

2006Hadoop @ Yahoo!

2004 2006 2008 2010 2012 2014

2014Apache Spark top-level

2010Spark paper

2008Hadoop Summit

15

circa 2002: mitigate risk of large distributed workloads lost due to disk failures on commodity hardware…

Google File System Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung research.google.com/archive/gfs.html !MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean, Sanjay Ghemawat research.google.com/archive/mapreduce.html

A Brief History: MapReduce

http://research.google.com/archive/gfs.html

http://research.google.com/archive/mapreduce.html

16

MR doesn’t compose well for large applications, and so specialized systems emerged as workarounds

MapReduce

General Batch Processing Specialized Systems: iterative, interactive, streaming, graph, etc.

Pregel Giraph

Dremel Drill

TezImpala

GraphLab

StormS4

F1

MillWheel

A Brief History: MapReduce

Developed in 2009 at UC Berkeley AMPLab, then open sourced in 2010, Spark has since become one of the largest OSS communities in big data, with over 200 contributors in 50+ organizations

spark.apache.org

“Organizations that are looking at big data challenges – including collection, ETL, storage, exploration and analytics – should consider Spark for its in-memory performance and the breadth of its model. It supports advanced analytics solutions on Hadoop clusters, including the iterative model required for machine learning and graph analysis.”

Gartner, Advanced Analytics and Data Science (2014)

17

A Brief History: Spark













































































18

A Brief History: Spark






























































































Spark is one of the most active Apache projects ohloh.net/orgs/apache

19

TL;DR: Sustained Exponential Growth

https://www.ohloh.net/orgs/apache






























































































databricks.com/blog/2015/01/27/big-data-projects-are-hungry-for-simpler-and-more-powerful-tools-survey-validates-apache-spark-is-gaining-developer-traction.html

TL;DR: Spark Survey 2015 by Databricks + Typesafe

20

https://databricks.com/blog/2015/01/27/big-data-projects-are-hungry-for-simpler-and-more-powerful-tools-survey-validates-apache-spark-is-gaining-developer-traction.html






























































































databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html

TL;DR: Smashing The Previous Petabyte Sort Record

21

http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html






























































































oreilly.com/data/free/2014-data-science-salary-survey.csp

TL;DR: Spark Expertise Tops Median Salaries within Big Data

22

http://www.oreilly.com/data/free/2014-data-science-salary-survey.csp






























































































Unifying the Pieces

WordCount in 3 lines of Spark

WordCount in 50+ lines of Java MR

24

Simple Spark Apps: WordCount

val sqlContext = new org.apache.spark.sql.SQLContext(sc)!import sqlContext._!!// Define the schema using a case class.!case class Person(name: String, age: Int)!!// Create an RDD of Person objects and register it as a table.!val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt))!!people.registerTempTable("people")!!// SQL statements can be run by using the sql methods provided by sqlContext.!val teenagers = sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")!!// The results of SQL queries are SchemaRDDs and support all the !// normal RDD operations.!// The columns of a row in the result can be accessed by ordinal.!teenagers.map(t => "Name: " + t(0)).collect().foreach(println)

Data Workflows: Spark SQL

25

// http://spark.apache.org/docs/latest/streaming-programming-guide.html!!import org.apache.spark.streaming._!import org.apache.spark.streaming.StreamingContext._!!// create a StreamingContext with a SparkConf configuration!val ssc = new StreamingContext(sparkConf, Seconds(10))!!// create a DStream that will connect to serverIP:serverPort!val lines = ssc.socketTextStream(serverIP, serverPort)!!// split each line into words!val words = lines.flatMap(_.split(" "))!!// count each word in each batch!val pairs = words.map(word => (word, 1))!val wordCounts = pairs.reduceByKey(_ + _)!!// print a few of the counts to the console!wordCounts.print()!!ssc.start() // start the computation!ssc.awaitTermination() // wait for the computation to terminate

Data Workflows: Spark Streaming

26

http://spark.apache.org/docs/latest/streaming-programming-guide.html

27

spark.apache.org/docs/latest/mllib-guide.html !Key Points: !• framework vs. library • scale, parallelism, sparsity • building blocks for long-term approach

MLI: An API for Distributed Machine Learning Evan Sparks, Ameet Talwalkar, et al. International Conference on Data Mining (2013) http://arxiv.org/abs/1310.5426

Data Workflows: MLlib

http://spark.apache.org/docs/latest/mllib-guide.html

http://arxiv.org/abs/1310.5426

CommunityResources

community:

spark.apache.org/community.html

events worldwide: goo.gl/2YqJZK

!video+preso archives: spark-summit.org

resources: databricks.com/spark-training-resources

workshops: databricks.com/spark-training

http://spark.apache.org/community.html

http://goo.gl/2YqJZK

http://spark-summit.org

http://databricks.com/spark-training-resources

http://databricks.com/spark-training

books:

Fast Data Processing with Spark Holden Karau Packt (2013) shop.oreilly.com/product/9781782167068.do

Spark in Action Chris FreglyManning (2015*) sparkinaction.com/

Learning Spark Holden Karau, Andy Konwinski, Matei ZahariaO’Reilly (2015*) shop.oreilly.com/product/0636920028512.do

http://shop.oreilly.com/product/9781782167068.do

http://sparkinaction.com/



31

http://spark-summit.org/20% discount:SSEDBFRIEND20

http://spark-summit.org/

Technology

Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More