BDM26: Spark Summit 2014 Debriefing

  • Published on

  • View

  • Download

Embed Size (px)


Spark Summit 2014 DebriefingDavid LauzonPresented at Big Data Montreal #26 on July 8th 2014PlanSpark Summit 2014 summaryTachyonBlinkDBDatabricks CloudDisclaimerI havent use Spark yet I havent validated all the info gathered in this presentationTry it out for yourself :-)Sparks Role in the Big Data EcosystemMatei Zaharia (CTO, Databricks)Spark is now the most active project in the Hadoop ecosystem

The goal of Spark is to be a unified platform and standard library for big data apps

native driveril serait bien de specifier ici que cassandra a maintenant un driver natif pour sparkWhats Next for BDAS?Mike Franklin(Director, UC Berkeley AMPLab)LAYERSApplication

Data ProcessingResource ManagementData ManagementBDAS Summary (1/2)Spark CoreGeneral purpose low level low latency processing engine.Supports: HDFS API, Amazon S3 API, and Hive metadataSharkReplaces Hives execution engine from MapReduce by SparkSpark StreamingCompetitor to Storm. Inputs from Kafka, Flume, Twitter, TCP socketsMLlibMLlib = low level machine library running on Spark.MLbase (in dev)Competitor to Mahout, runs on top of MLlib.GraphX (in dev)Enable users to interactively build, transform, and reason about graph structured at scaleBDAS Summary (2/2)BlinkDB (alpha)SQL Queries with Bounded Errors and Bounded Response Times on Very Large DataSparkR (alpha)Run R on top of SparkTachyonA reliable in-memory distributed file system providing a HDFS compatible API.Can persist data to HDFS, Amazon S3, LocalFS, etc.MesosCluster resource manager, multi-tenancy

Spark and the future of big data applicationsEric Baldeschwieler (Tech Advisor)Big Data Application Model

Sparks current (v1.0) challengesBetter job scheduling toolsIncrease focus on ETLR bindingsExtend SparkSQL to run on more data storesAdd more machine learning algorithmsBasics: stability, profiling & debugging, error reporting, logging, etc.add stabilitydoneSparks current (v1.0) challengesBetter stabilityProfiling & debuggingError reportingLoggingThe Future of SparkPatrick Wendell (Databricks)Timelineand:join optimisations

MLib: from 15 to 30 algorithms

Core internal API for pluggable implementations

The Emergence of the Enterprise Data HubMike Olson (Chief Strategy Officer, Cloudera)

(a vision of the future)This means that sooner or later ...

Hadoop MapReduce

Spark meets Genomics: Helping Fight the Big C with the Big DDavid Patterson (AMP Lab, UC Berkeley)SNAP: Scalable Nucleotide Alignment Program=> A new genome aligner based on Spark that is 10-100X faster and simultaneously more accurate than existing tools based on MapReduce or other algorithms [1][1] helps save a life [1] A teenager was hospitalized for 5 weeks without successful diagnosisHe developed brain seizures and was placed in a medically induced comaWith a sample of his spinal fluid and the use of Snap, a rare infectious bacterium was foundBoy was treated, and discharged 4 weeks later[1] Update and Announcing Databricks CloudIon Stoica (CEO, Databricks)

even RedHat FedoraNew: Databricks Cloud Platform

a kind of cloud-hosted iPython NotebookDatabricks Platform

Databricks Workspace: Notebooks

Databricks Workspace: Dashboards

Databricks Cloud DemoThe following video extract integrates:Databricks WorkspaceDatabricks PlatformSpark StreamingSpark SQLSpark MLLib

Demo Wikipedia ML Twitter realtime graphDatabricks Cloud Demo14min extract:

Full video:

Demo Wikipedia ML Twitter realtime graphDatabricks CloudGreat tool for data scientists

Demo Wikipedia ML Twitter realtime graphConclusionConclusionMost interesting Spark related projects:SparkSQLBlinkDBTachyonDatabricks CloudDemo Wikipedia ML Twitter realtime graph