Upload
david-lauzon
View
103
Download
0
Tags:
Embed Size (px)
Citation preview
Spark Summit 2014
Debriefing
David Lauzon
Presented at Big Data Montreal #26 on July 8th 2014
Plan
● Spark Summit 2014 summary
● Tachyon
● BlinkDB
● Databricks Cloud
Disclaimer
I haven’t use Spark yet
I haven’t validated all the info gathered in this
presentation
Try it out for yourself :-)
Spark’s Role in the Big
Data Ecosystem
Matei Zaharia (CTO, Databricks)
“Spark is now the most active
project in the Hadoop ecosystem”
“The goal of Spark is to be a unified
platform and standard library for big
data apps”
native driver
What’s Next for BDAS?
Mike Franklin
(Director, UC Berkeley AMPLab)
LAYERSApplication
Data Processing
Resource
Management
Data
Management
BDAS Summary (1/2)
Spark Core General purpose low level low latency processing engine.
Supports: HDFS API, Amazon S3 API, and Hive metadata
Shark Replaces Hive’s execution engine from MapReduce by Spark
Spark Streaming Competitor to Storm. Inputs from Kafka, Flume, Twitter, TCP
sockets
MLlib MLlib = low level machine library running on Spark.
MLbase (in dev) Competitor to Mahout, runs on top of MLlib.
GraphX (in dev) Enable users to interactively build, transform, and reason about
graph structured at scale
BDAS Summary (2/2)
BlinkDB (alpha) SQL Queries with Bounded Errors and Bounded Response
Times on Very Large Data
SparkR (alpha) Run R on top of Spark
Tachyon A reliable in-memory distributed file system providing a HDFS
compatible API.
Can persist data to HDFS, Amazon S3, LocalFS, etc.
Mesos Cluster resource manager, multi-tenancy
Spark and the future of
big data applications
Eric Baldeschwieler (Tech Advisor)
Big Data Application Model
Spark’s current (v1.0) challenges
Better job scheduling tools
Increase focus on ETL
R bindings
Extend SparkSQL to run on more data stores
Add more machine learning algorithms
Basics: stability, profiling & debugging, error
reporting, logging, etc.
Spark’s current (v1.0) challenges
Better stability
Profiling & debugging
Error reporting
Logging
The Future of Spark
Patrick Wendell (Databricks)
Timeline
and:● join optimisations
● MLib: from 15 to 30 algorithms
● Core internal API for pluggable
implementations
The Emergence of the
Enterprise Data Hub
Mike Olson (Chief Strategy Officer,
Cloudera)
(a vision
of the future)
This means that sooner or later ...
Hadoop
MapReduce
Spark meets Genomics:
Helping Fight the Big C
with the Big D
David Patterson (AMP Lab, UC Berkeley)
SNAP: Scalable Nucleotide
Alignment Program
=> A new genome aligner based on Spark that
is 10-100X faster and simultaneously more
accurate than existing tools based on
MapReduce or other algorithms [1]
[1] https://amplab.cs.berkeley.edu/projects/snap/
SNAP helps save a life [1]
A teenager was hospitalized for 5 weeks
without successful diagnosis
He developed brain seizures and was placed in
a medically induced coma
With a sample of his spinal fluid and the use of
Snap, a rare infectious bacterium was found
Boy was treated, and discharged 4 weeks later
[1] https://amplab.cs.berkeley.edu/2014/06/04/snap-helps-save-a-life/
Databricks Update and
Announcing Databricks
Cloud
Ion Stoica (CEO, Databricks)
even RedHat Fedora
New: Databricks Cloud Platform
Databricks Platform
Databricks Workspace: Notebooks
Databricks Workspace: Dashboards
Databricks Cloud Demo
The following video extract integrates:
● Databricks Workspace
● Databricks Platform
● Spark Streaming
● Spark SQL
● Spark MLLib
Databricks Cloud Demo
14min extract:http://youtu.be/dJQ5lV5Tldw?t=26m57s
Full video:https://www.youtube.com/watch?v=dJQ5lV5Tldw
Databricks Cloud
Great tool for data scientists
Conclusion
Conclusion
Most interesting Spark related projects:
● SparkSQL
● BlinkDB
● Tachyon
● Databricks Cloud