40
Spark Summit 2014 Debriefing David Lauzon Presented at Big Data Montreal #26 on July 8th 2014

BDM26: Spark Summit 2014 Debriefing

Embed Size (px)

Citation preview

Page 1: BDM26: Spark Summit 2014 Debriefing

Spark Summit 2014

Debriefing

David Lauzon

Presented at Big Data Montreal #26 on July 8th 2014

Page 2: BDM26: Spark Summit 2014 Debriefing

Plan

● Spark Summit 2014 summary

● Tachyon

● BlinkDB

● Databricks Cloud

Page 3: BDM26: Spark Summit 2014 Debriefing

Disclaimer

I haven’t use Spark yet

I haven’t validated all the info gathered in this

presentation

Try it out for yourself :-)

Page 4: BDM26: Spark Summit 2014 Debriefing

Spark’s Role in the Big

Data Ecosystem

Matei Zaharia (CTO, Databricks)

Page 5: BDM26: Spark Summit 2014 Debriefing

“Spark is now the most active

project in the Hadoop ecosystem”

Page 6: BDM26: Spark Summit 2014 Debriefing
Page 7: BDM26: Spark Summit 2014 Debriefing

“The goal of Spark is to be a unified

platform and standard library for big

data apps”

Page 8: BDM26: Spark Summit 2014 Debriefing

native driver

Page 9: BDM26: Spark Summit 2014 Debriefing

What’s Next for BDAS?

Mike Franklin

(Director, UC Berkeley AMPLab)

Page 10: BDM26: Spark Summit 2014 Debriefing

LAYERSApplication

Data Processing

Resource

Management

Data

Management

Page 11: BDM26: Spark Summit 2014 Debriefing

BDAS Summary (1/2)

Spark Core General purpose low level low latency processing engine.

Supports: HDFS API, Amazon S3 API, and Hive metadata

Shark Replaces Hive’s execution engine from MapReduce by Spark

Spark Streaming Competitor to Storm. Inputs from Kafka, Flume, Twitter, TCP

sockets

MLlib MLlib = low level machine library running on Spark.

MLbase (in dev) Competitor to Mahout, runs on top of MLlib.

GraphX (in dev) Enable users to interactively build, transform, and reason about

graph structured at scale

Page 12: BDM26: Spark Summit 2014 Debriefing

BDAS Summary (2/2)

BlinkDB (alpha) SQL Queries with Bounded Errors and Bounded Response

Times on Very Large Data

SparkR (alpha) Run R on top of Spark

Tachyon A reliable in-memory distributed file system providing a HDFS

compatible API.

Can persist data to HDFS, Amazon S3, LocalFS, etc.

Mesos Cluster resource manager, multi-tenancy

Page 13: BDM26: Spark Summit 2014 Debriefing
Page 14: BDM26: Spark Summit 2014 Debriefing
Page 15: BDM26: Spark Summit 2014 Debriefing
Page 16: BDM26: Spark Summit 2014 Debriefing

Spark and the future of

big data applications

Eric Baldeschwieler (Tech Advisor)

Page 17: BDM26: Spark Summit 2014 Debriefing

Big Data Application Model

Page 18: BDM26: Spark Summit 2014 Debriefing

Spark’s current (v1.0) challenges

Better job scheduling tools

Increase focus on ETL

R bindings

Extend SparkSQL to run on more data stores

Add more machine learning algorithms

Basics: stability, profiling & debugging, error

reporting, logging, etc.

Page 19: BDM26: Spark Summit 2014 Debriefing

Spark’s current (v1.0) challenges

Better stability

Profiling & debugging

Error reporting

Logging

Page 20: BDM26: Spark Summit 2014 Debriefing

The Future of Spark

Patrick Wendell (Databricks)

Page 21: BDM26: Spark Summit 2014 Debriefing

Timeline

and:● join optimisations

● MLib: from 15 to 30 algorithms

● Core internal API for pluggable

implementations

Page 22: BDM26: Spark Summit 2014 Debriefing

The Emergence of the

Enterprise Data Hub

Mike Olson (Chief Strategy Officer,

Cloudera)

Page 23: BDM26: Spark Summit 2014 Debriefing
Page 24: BDM26: Spark Summit 2014 Debriefing

(a vision

of the future)

Page 25: BDM26: Spark Summit 2014 Debriefing

This means that sooner or later ...

Hadoop

MapReduce

Page 26: BDM26: Spark Summit 2014 Debriefing
Page 27: BDM26: Spark Summit 2014 Debriefing

Spark meets Genomics:

Helping Fight the Big C

with the Big D

David Patterson (AMP Lab, UC Berkeley)

Page 28: BDM26: Spark Summit 2014 Debriefing

SNAP: Scalable Nucleotide

Alignment Program

=> A new genome aligner based on Spark that

is 10-100X faster and simultaneously more

accurate than existing tools based on

MapReduce or other algorithms [1]

[1] https://amplab.cs.berkeley.edu/projects/snap/

Page 29: BDM26: Spark Summit 2014 Debriefing

SNAP helps save a life [1]

A teenager was hospitalized for 5 weeks

without successful diagnosis

He developed brain seizures and was placed in

a medically induced coma

With a sample of his spinal fluid and the use of

Snap, a rare infectious bacterium was found

Boy was treated, and discharged 4 weeks later

[1] https://amplab.cs.berkeley.edu/2014/06/04/snap-helps-save-a-life/

Page 30: BDM26: Spark Summit 2014 Debriefing

Databricks Update and

Announcing Databricks

Cloud

Ion Stoica (CEO, Databricks)

Page 31: BDM26: Spark Summit 2014 Debriefing

even RedHat Fedora

Page 32: BDM26: Spark Summit 2014 Debriefing

New: Databricks Cloud Platform

Page 33: BDM26: Spark Summit 2014 Debriefing

Databricks Platform

Page 34: BDM26: Spark Summit 2014 Debriefing

Databricks Workspace: Notebooks

Page 35: BDM26: Spark Summit 2014 Debriefing

Databricks Workspace: Dashboards

Page 36: BDM26: Spark Summit 2014 Debriefing

Databricks Cloud Demo

The following video extract integrates:

● Databricks Workspace

● Databricks Platform

● Spark Streaming

● Spark SQL

● Spark MLLib

Page 37: BDM26: Spark Summit 2014 Debriefing

Databricks Cloud Demo

14min extract:http://youtu.be/dJQ5lV5Tldw?t=26m57s

Full video:https://www.youtube.com/watch?v=dJQ5lV5Tldw

Page 38: BDM26: Spark Summit 2014 Debriefing

Databricks Cloud

Great tool for data scientists

Page 39: BDM26: Spark Summit 2014 Debriefing

Conclusion

Page 40: BDM26: Spark Summit 2014 Debriefing

Conclusion

Most interesting Spark related projects:

● SparkSQL

● BlinkDB

● Tachyon

● Databricks Cloud