Upload
slim-baltagi
View
14.324
Download
1
Embed Size (px)
Citation preview
Spark or Hadoop Is it an either-or proposition
By Slim Baltagi (SlimBaltagi)sbaltagigmailcom
httpwwwSparkBigDatacom
ORXOR
Los Angeles Spark Users Group March 12 2015
2
Your Presenter ndash Slim Baltagibull Sr Big Data Solutions Architect
living in Chicagobull Over 17 years of IT and business
experiencesbull Over 4 years of Big Data
experience working on over 12 Hadoop projects
bull Speaker at Big Data eventsbull Creator and maintainer of the
Apache Spark Knowledge Base httpwwwSparkBigDatacom with over 4000 categorized Apache Spark web resources
SlimBaltagi
httpswwwlinkedincominslimbaltagi
sbaltagigmailcom
Disclaimer This is a vendor-independent talk that expresses my own opinions I am not endorsing nor promoting any product or vendor mentioned in this talk
3
AgendaI MotivationII Big Data Typical Big Data
Stack Apache Hadoop Apache Spark
III Spark with HadoopIV Spark without HadoopV More QampA
4
I Motivation1 News 2 Surveys3 Vendors4 Analysts5 Key Takeaways
5
1 Newsbull Is it Spark vs OR and Hadoopbull Apache Spark Hadoop friend or foebull Apache Spark killer or savior of Apache Hadoopbull Apache Sparks Marriage To Hadoop Will Be Bigger
Than Kim And Kanye bull Adios Hadoop Hola Sparkbull Apache Spark Moving on from Hadoopbull Apache Spark Continues to Spread Beyond Hadoopbull Escape From Hadoopbull Spark promises to end up Hadoop but in a good way
6
2 Surveysbull Hadoops historic focus on batch processing of data
was well supported by MapReduce but there is an appetite for more flexible developer tools to support the larger market of mid-size datasets and use cases that call for real-time processingrdquo 2015 Apache Spark Survey by Typesafe January 27 2015 httpwwwmarketwiredcompress-releasesurvey-indicates-apache-spark-gaining-developer-adoption-as-big-datas-projects-1986162htm
bull Apache Spark Preparing for the Next Wave of Reactive Big Data January 27 2015 by Typesafehttptypesafecomblogapache-spark-preparing-for-the-next-wave-of-reactive-big-data
7
Apache Spark Survey 2015 by Typesafe - Quick Snapshot
8
3 Vendorsbull Spark and Hadoop Working Together January 21
2014 by Ion Stoica httpsdatabrickscomblog20140121spark-and-hadoophtml
bull Uniform API for diverse workloads over diverse storage systems and runtimes Source Slide 16 of lsquoSparks Role in the Big Data Ecosystem (Spark Summit 2014) November 2014 Matei Zahariahttpwwwslidesharenetdatabricksspark-summit2014
bull The goal of Apache Spark is to have one engine for all data sources workloads and environmentsrdquo Source Slide 15 of lsquoNew Directions for Apache Spark in 2015 February 20 2015 Strata + Hadoop Summit Matei Zahariahttpwwwslidesharenetdatabricksnew-directions-for-apache-spark-in-2015
9
3 Vendorsbull ldquoSpark is already an excellent piece of software and is
advancing very quickly No vendor mdash no new project mdash is likely to catch up Chasing Spark would be a waste of time and would delay availability of real-time analytic and processing services for no good reason rdquoSource MapReduce and Spark December 302013 httpvisionclouderacommapreduce-spark
bull ldquoApache Spark is an open source parallel data processing framework that complements Apache Hadoop to make it easy to develop fast unified Big Data applications combining batch streaming and interactive analytics on all your datardquohttpwwwclouderacomcontentclouderaenproducts-and-servicescdhsparkhtml
10
3 Vendorsbull ldquoApache Spark is a general-purpose engine for large-
scale data processing Spark supports rapid application development for big data and allows for code reuse across batch interactive and streaming applications Spark also provides advanced execution graphs with in-memory pipelining to speed up end-to-end application performancerdquo httpswwwmaprcomproductsapache-spark
bull MapR Adds Complete Apache Spark Stack to its Distribution for Hadoophttpswwwmaprcomcompanypress-releasesmapr-adds-complete-apache-spark-stack-its-distribution-hadoop
11
3 Vendorsbull ldquoApache Spark provides an elegant attractive development API and allows data workers to rapidly iterate over data via machine learning and other data science techniques that require fast in-memory data processingrdquo httphortonworkscomhadoopspark
bull Hortonworks A shared vision for Apache Spark on Hadoop October 21 2014httpsdatabrickscomblog20141031hortonworks-a-shared-vision-for-apache-spark-on-hadoophtml
bull ldquoAt Hortonworks we love Spark and want to help our customers leverage all its benefitsrdquo October 30th 2014httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
12
4 Analystsbull Is Apache Spark replacing Hadoop or complementing
existing Hadoop practicebull Both are already happening
bull With uncertainty about ldquowhat is Hadooprdquo there is no reason to think solution stacks built on Spark not positioned as Hadoop will not continue to proliferate as the technology matures
bull At the same time Hadoop distributions are all embracing Spark and including it in their offerings
Source Hadoop Questions from Recent Webinar Span Spectrum February 25 2015httpblogsgartnercommerv-adrian20150225hadoop-questions-from-recent-webinar-span-spectrum
13
4 Analysts bull ldquoAfter hearing the confusion between Spark and Hadoop
one too many times I was inspired to write a report The Hadoop Ecosystem Overview Q4 2104
bull For those that have day jobs that donrsquot include constantly tracking Hadoop evolution I dove in and worked with Hadoop vendors and trusted consultants to create a framework
bull We divided the complex Hadoop ecosystem into a core set of tools that all work closely with data stored in Hadoop File System and extended group of components that leverage but do not require itrdquo Source Elephants Pigs Rhinos and Giraphs Oh My ndash Its Time To Get A Handle On Hadoop Posted by Brian Hopkins on November 26 2014
httpblogsforrestercombrian_hopkins14-11-26-elephants_pigs_rhinos_and_giraphs_oh_my_its_time_to_get_a_handle_on_hadoop
14
5 Key Takeaways1 News Big Data is no longer a Hadoop
monopoly2 Surveys Listen to what Spark developers are
saying 3 Vendors ltHadoop Vendorgt-tinted goggles
FUD is still being lsquoofferedrsquo by some Hadoop vendors Claims need to be contextualized
4 Analysts Thorough understanding of the market dynamics
15
II Big Data Typical Big Data Stack Hadoop Spark
1 Big Data2 Typical Big Data Stack 3 Apache Hadoop4 Apache Spark5 Key Takeaways
16
1 Big Databull Big Data is still one of the most inflated buzzword of the last years
bull Big Data is a broad term for data sets so large or complex that traditional data processing tools are inadequate httpenwikipediaorgwikiBig_data
bull Hadoop is becoming a traditional tool Above definition is inadequate
bull ldquoBig Data refers to datasets and flows large enough that has outpaced our capability to store process analyze and understandrdquo Amir H Payberah Swedish Institute of Computer Science (SICS)
17
2 Typical Big Data Stack
18
3 Apache Hadoopbull Apache Hadoop as an example of a Typical Big Data
Stack bull Hadoop ecosystem = Hadoop Stack + many other tools
(either open source and free or commercial ones)bull Big Data Ecosystem Dataset httpbigdataandreamostosiname
Incomplete but a useful list of Big Data related projects packed into a JSON dataset
bull Hadoops Impact on Data Managements Future - Amr Awadallah (Strata + Hadoop 2015) February 19 2015 Watch video at 236 on lsquoHadoop Isnrsquot Just Hadoop Anymorersquo for a picture representing the evolution of Apache Hadoop httpswwwyoutubecomwatchv=1KvTZZAkHy0
19
4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stackbull Apache Spark provides you Big Data computing and more
bull BYOS Bring Your Own Storage bull BYOC Bring Your Own Clusterbull Spark Core httpsparkbigdatacomcomponenttagstag11-core-sparkbull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-
streamingbull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sqlbull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllibbull GraphX httpsparkbigdatacomcomponenttagstag6-graphx
bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned
20
5 Key Takeaways1 Big Data Still one of the most inflated
buzzword 2 Typical Big Data Stack Big Data Stacks look
similar on paper Arenrsquot they3 Apache Hadoop Hadoop is no longer
lsquosynonymousrsquo of Big Data4 Apache Spark Emergence of the Apache
Spark ecosystem
21
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
22
1 Evolution of Programming APIsbull MapReduce in Java is like assembly code of Big
Data httpwikiapacheorghadoopWordCount
bull Pig httppigapacheorg
bull Hive httphiveapacheorg
bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi
bull Cascading httpwwwcascadingorg
bull Scalding A Scala API for Cascading httptwittercomscalding
bull Crunch httpcrunchapacheorg
bull Scrunch httpcrunchapacheorgscrunchhtml
23
1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007 MapReduce v1 was the only choice as a compute model (Execution Engine) on Hadoop Now we have in addition to MapReduce v2 Tez Spark and Flink
bull Batch bull Batchbull Interactive
bull Batchbull Interactivebull Near-Real
time
bull Batchbull Interactivebull Real-Timebull Iterative
bull 1st Generation
bull 2nd
Generationbull 3rd
Generationbull 4th
Generation
24
1 Evolutionbull This is how Hadoop MapReduce is branding itself ldquoA YARN-based
system for parallel processing of large data sets httphadoopapacheorg
bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip
bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job
bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics
25
1 Evolutionbull Tez Hindi for ldquospeedrdquobull This is how Apache Tez is branding itself ldquoThe Apache Tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data It is currently built atop YARNrdquo
Source httptezapacheorg
bull Apachetrade Tez is an extensible framework for building high performance batch and interactive data processing applications coordinated by YARN in Apache Hadoop
26
1 Evolution bull lsquoSparkrsquo for lightning fast speedbull This is how Apache Spark is branding itself ldquoApache Sparktrade is a fast and general engine for large-scale data processingrdquo httpssparkapacheorg
bull Apache Spark is a general purpose cluster computing framework its execution model supports wide variety of use cases batch interactive near-real time
bull The rapid in-memory processing of resilient distributed datasets (RDDs) is the ldquocore capabilityrdquo of Apache Spark
27
1 Evolution Apache Flinkbull Flink German for ldquonimble swift speedyrdquobull This is how Apache Flink is branding itself ldquoFast and
reliable large-scale data processing enginerdquobull Apache Flink httpflinkapacheorg offers
bull Batch and Streaming in the same systembull Beyond DAGs (Cyclic operator graphs)bull Powerful expressive APIsbull Inside-the-system iterationsbull Full Hadoop compatibility bull Automatic language independent optimizer
bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink
28
Hadoop MapReduce vs Tez vs SparkCriteria
License Open SourceApache 20 version 2x
Open SourceApache 20 version 0x
Open SourceApache 20 version 1x
Processing Model
On-Disk (Disk- based parallelization) Batch
On-Disk Batch Interactive
In-Memory On-Disk Batch Interactive Streaming (Near Real-Time)
Language written in
Java Java Scala
API [Java Python Scala] User-Facing
Java[ ISVEngineTool builder]
[Scala Java Python] User-Facing
Libraries None separate tools None [Spark Core Spark Streaming Spark SQL MLlib GraphX]
29
Hadoop MapReduce vs Tez vs SparkCriteria
Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to Hadoop
Ease of Use Difficult to program needs abstractions
No Interactive mode except Hive Pig
Difficult to program
No Interactive mode except Hive Pig
Easy to program no need of abstractionsInteractive mode
Compatibility
to data types and data sources is same
to data types and data sources is same
to data types and data sources is same
YARN integration
YARN application Ground up YARN application
Spark is moving towards YARN
30
Hadoop MapReduce vs Tez vs SparkCriteria
Deployment YARN YARN [Standalone YARN SIMR Mesos hellip]
Performance - Good performance when data fits into memory
- performance degradation otherwise
Security More features and projects
More features and projects
Still in its infancy
Partial support
31
IV Spark with Hadoop
1 Evolution2 Transition3 Integration4 Complementarity5 Key Takeaways
32
2 Transitionbull Existing Hadoop MapReduce projects can migrate to Spark and leverage Spark Core as execution engine1 You can often reuse your mapper and
reducer functions and just call them in Spark from Java or Scala
2 You can translate your code from MapReduce to Apache Spark How-to Translate from MapReduce to Apache Spark
httpblogclouderacomblog201409how-to-translate-from-mapreduce-to-apache-spark
33
2 Transition3 The following tools originally based on Hadoop MapReduce are being ported to Apache Spark
bull Pig Hive Sqoop Cascading Crunch Mahout hellip
34
Pig on Spark (Spork)bull Run Pig with ldquondashx sparkrdquo option for an easy migration
without development effortbull Speed up your existing pig scripts on Spark ( Query
Logical Plan Physical Pan)bull Leverage new Spark specific operators in Pig such as
Cachebull Still leverage many existing Pig UDF librariesbull Pig on Spark Umbrella Jira (Status Passed end-to-end test
cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059bull Fix outstanding issues and address additional Spark functionality
through the community
bull lsquoPig on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag19
35
Hive on Spark (Currently in Beta Expected in Hive 110)
bull New alternative to using MapReduce or Tez hivegt set hiveexecutionengine=sparkbull Help existing Hive applications running on
MapReduce or Tez easily migrate to Spark without development effort
bull Exposes Spark users to a viable feature-rich de facto standard SQL tool on Hadoop
bull Performance benefits especially for Hive queries involving multiple reducer stages
bull Hive on Spark Umbrella Jira (Status Open) Q1 2015 httpsissuesapacheorgjirabrowseHIVE-7292
36
Hive on Spark (Currently in Beta Expected in Hive 110)
bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-motivations-and-design-principles
bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Started
bull Hive on Spark February 11 2015 Szehon Ho Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark
bull Hive on spark is blazing fast or is it Carter Shanklin and Mostapah Mokhtar (Hortonworks) February 20 2015 httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final
bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12
37
Sqoop on Spark (Expected in Sqoop 2)
bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop
bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources
bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal
bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532
38
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop
bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark
39
Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml
40
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
41
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
42
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
43
3 IntegrationService Open Source Tool
StorageServing Layer
Data Formats
Data Ingestion ServicesResource Management
Search
SQL
44
3 Integrationbull Spark was designed to read and write data from and to HDFS as
well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851
45
3 Integrationbull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase
46
3 Integration bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra
47
3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new avenues
bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra
48
3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo
httpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights
bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
49
3 Integration bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu
ction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
50
3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph
databasebull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
51
3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A
bull Some issues are critical ones bull Running Spark on YARN
httpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
52
3 Integrationbull Spark SQL provides built in support for Hive tables
bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib
53
3 Integration bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline
Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
54
3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka
55
3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
56
3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data
bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml
57
3 Integrationbull Apache Parquet is a columnar storage format available
to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
58
3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark
SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem
bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format
59
3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
60
3 Integration bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html
61
3 Integration
bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from
MapReduce to Sparkbull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed
62
3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014
63
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
64
4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them
Hadoop ecosystem Spark ecosystem
65
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
66
4 Complementarity + bull Mesos and YARN can work together each for what
it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom
httpsparkbigdatacomcomponenttagstag41
67
4 Complementarity + References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management
bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620
68
4 Complementarity + bull Spark on Tez for efficient ETL https
githubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)
bull Tez supports enterprise security
69
4 Complementarity + bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
70
4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml
71
4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop
72
5 Key Takeaways1 Evolution of compute models is still ongoing Watch
out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
73
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
74
1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt
he-perfect-match-apache-spark-meets-swift
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
2
Your Presenter ndash Slim Baltagibull Sr Big Data Solutions Architect
living in Chicagobull Over 17 years of IT and business
experiencesbull Over 4 years of Big Data
experience working on over 12 Hadoop projects
bull Speaker at Big Data eventsbull Creator and maintainer of the
Apache Spark Knowledge Base httpwwwSparkBigDatacom with over 4000 categorized Apache Spark web resources
SlimBaltagi
httpswwwlinkedincominslimbaltagi
sbaltagigmailcom
Disclaimer This is a vendor-independent talk that expresses my own opinions I am not endorsing nor promoting any product or vendor mentioned in this talk
3
AgendaI MotivationII Big Data Typical Big Data
Stack Apache Hadoop Apache Spark
III Spark with HadoopIV Spark without HadoopV More QampA
4
I Motivation1 News 2 Surveys3 Vendors4 Analysts5 Key Takeaways
5
1 Newsbull Is it Spark vs OR and Hadoopbull Apache Spark Hadoop friend or foebull Apache Spark killer or savior of Apache Hadoopbull Apache Sparks Marriage To Hadoop Will Be Bigger
Than Kim And Kanye bull Adios Hadoop Hola Sparkbull Apache Spark Moving on from Hadoopbull Apache Spark Continues to Spread Beyond Hadoopbull Escape From Hadoopbull Spark promises to end up Hadoop but in a good way
6
2 Surveysbull Hadoops historic focus on batch processing of data
was well supported by MapReduce but there is an appetite for more flexible developer tools to support the larger market of mid-size datasets and use cases that call for real-time processingrdquo 2015 Apache Spark Survey by Typesafe January 27 2015 httpwwwmarketwiredcompress-releasesurvey-indicates-apache-spark-gaining-developer-adoption-as-big-datas-projects-1986162htm
bull Apache Spark Preparing for the Next Wave of Reactive Big Data January 27 2015 by Typesafehttptypesafecomblogapache-spark-preparing-for-the-next-wave-of-reactive-big-data
7
Apache Spark Survey 2015 by Typesafe - Quick Snapshot
8
3 Vendorsbull Spark and Hadoop Working Together January 21
2014 by Ion Stoica httpsdatabrickscomblog20140121spark-and-hadoophtml
bull Uniform API for diverse workloads over diverse storage systems and runtimes Source Slide 16 of lsquoSparks Role in the Big Data Ecosystem (Spark Summit 2014) November 2014 Matei Zahariahttpwwwslidesharenetdatabricksspark-summit2014
bull The goal of Apache Spark is to have one engine for all data sources workloads and environmentsrdquo Source Slide 15 of lsquoNew Directions for Apache Spark in 2015 February 20 2015 Strata + Hadoop Summit Matei Zahariahttpwwwslidesharenetdatabricksnew-directions-for-apache-spark-in-2015
9
3 Vendorsbull ldquoSpark is already an excellent piece of software and is
advancing very quickly No vendor mdash no new project mdash is likely to catch up Chasing Spark would be a waste of time and would delay availability of real-time analytic and processing services for no good reason rdquoSource MapReduce and Spark December 302013 httpvisionclouderacommapreduce-spark
bull ldquoApache Spark is an open source parallel data processing framework that complements Apache Hadoop to make it easy to develop fast unified Big Data applications combining batch streaming and interactive analytics on all your datardquohttpwwwclouderacomcontentclouderaenproducts-and-servicescdhsparkhtml
10
3 Vendorsbull ldquoApache Spark is a general-purpose engine for large-
scale data processing Spark supports rapid application development for big data and allows for code reuse across batch interactive and streaming applications Spark also provides advanced execution graphs with in-memory pipelining to speed up end-to-end application performancerdquo httpswwwmaprcomproductsapache-spark
bull MapR Adds Complete Apache Spark Stack to its Distribution for Hadoophttpswwwmaprcomcompanypress-releasesmapr-adds-complete-apache-spark-stack-its-distribution-hadoop
11
3 Vendorsbull ldquoApache Spark provides an elegant attractive development API and allows data workers to rapidly iterate over data via machine learning and other data science techniques that require fast in-memory data processingrdquo httphortonworkscomhadoopspark
bull Hortonworks A shared vision for Apache Spark on Hadoop October 21 2014httpsdatabrickscomblog20141031hortonworks-a-shared-vision-for-apache-spark-on-hadoophtml
bull ldquoAt Hortonworks we love Spark and want to help our customers leverage all its benefitsrdquo October 30th 2014httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
12
4 Analystsbull Is Apache Spark replacing Hadoop or complementing
existing Hadoop practicebull Both are already happening
bull With uncertainty about ldquowhat is Hadooprdquo there is no reason to think solution stacks built on Spark not positioned as Hadoop will not continue to proliferate as the technology matures
bull At the same time Hadoop distributions are all embracing Spark and including it in their offerings
Source Hadoop Questions from Recent Webinar Span Spectrum February 25 2015httpblogsgartnercommerv-adrian20150225hadoop-questions-from-recent-webinar-span-spectrum
13
4 Analysts bull ldquoAfter hearing the confusion between Spark and Hadoop
one too many times I was inspired to write a report The Hadoop Ecosystem Overview Q4 2104
bull For those that have day jobs that donrsquot include constantly tracking Hadoop evolution I dove in and worked with Hadoop vendors and trusted consultants to create a framework
bull We divided the complex Hadoop ecosystem into a core set of tools that all work closely with data stored in Hadoop File System and extended group of components that leverage but do not require itrdquo Source Elephants Pigs Rhinos and Giraphs Oh My ndash Its Time To Get A Handle On Hadoop Posted by Brian Hopkins on November 26 2014
httpblogsforrestercombrian_hopkins14-11-26-elephants_pigs_rhinos_and_giraphs_oh_my_its_time_to_get_a_handle_on_hadoop
14
5 Key Takeaways1 News Big Data is no longer a Hadoop
monopoly2 Surveys Listen to what Spark developers are
saying 3 Vendors ltHadoop Vendorgt-tinted goggles
FUD is still being lsquoofferedrsquo by some Hadoop vendors Claims need to be contextualized
4 Analysts Thorough understanding of the market dynamics
15
II Big Data Typical Big Data Stack Hadoop Spark
1 Big Data2 Typical Big Data Stack 3 Apache Hadoop4 Apache Spark5 Key Takeaways
16
1 Big Databull Big Data is still one of the most inflated buzzword of the last years
bull Big Data is a broad term for data sets so large or complex that traditional data processing tools are inadequate httpenwikipediaorgwikiBig_data
bull Hadoop is becoming a traditional tool Above definition is inadequate
bull ldquoBig Data refers to datasets and flows large enough that has outpaced our capability to store process analyze and understandrdquo Amir H Payberah Swedish Institute of Computer Science (SICS)
17
2 Typical Big Data Stack
18
3 Apache Hadoopbull Apache Hadoop as an example of a Typical Big Data
Stack bull Hadoop ecosystem = Hadoop Stack + many other tools
(either open source and free or commercial ones)bull Big Data Ecosystem Dataset httpbigdataandreamostosiname
Incomplete but a useful list of Big Data related projects packed into a JSON dataset
bull Hadoops Impact on Data Managements Future - Amr Awadallah (Strata + Hadoop 2015) February 19 2015 Watch video at 236 on lsquoHadoop Isnrsquot Just Hadoop Anymorersquo for a picture representing the evolution of Apache Hadoop httpswwwyoutubecomwatchv=1KvTZZAkHy0
19
4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stackbull Apache Spark provides you Big Data computing and more
bull BYOS Bring Your Own Storage bull BYOC Bring Your Own Clusterbull Spark Core httpsparkbigdatacomcomponenttagstag11-core-sparkbull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-
streamingbull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sqlbull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllibbull GraphX httpsparkbigdatacomcomponenttagstag6-graphx
bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned
20
5 Key Takeaways1 Big Data Still one of the most inflated
buzzword 2 Typical Big Data Stack Big Data Stacks look
similar on paper Arenrsquot they3 Apache Hadoop Hadoop is no longer
lsquosynonymousrsquo of Big Data4 Apache Spark Emergence of the Apache
Spark ecosystem
21
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
22
1 Evolution of Programming APIsbull MapReduce in Java is like assembly code of Big
Data httpwikiapacheorghadoopWordCount
bull Pig httppigapacheorg
bull Hive httphiveapacheorg
bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi
bull Cascading httpwwwcascadingorg
bull Scalding A Scala API for Cascading httptwittercomscalding
bull Crunch httpcrunchapacheorg
bull Scrunch httpcrunchapacheorgscrunchhtml
23
1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007 MapReduce v1 was the only choice as a compute model (Execution Engine) on Hadoop Now we have in addition to MapReduce v2 Tez Spark and Flink
bull Batch bull Batchbull Interactive
bull Batchbull Interactivebull Near-Real
time
bull Batchbull Interactivebull Real-Timebull Iterative
bull 1st Generation
bull 2nd
Generationbull 3rd
Generationbull 4th
Generation
24
1 Evolutionbull This is how Hadoop MapReduce is branding itself ldquoA YARN-based
system for parallel processing of large data sets httphadoopapacheorg
bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip
bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job
bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics
25
1 Evolutionbull Tez Hindi for ldquospeedrdquobull This is how Apache Tez is branding itself ldquoThe Apache Tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data It is currently built atop YARNrdquo
Source httptezapacheorg
bull Apachetrade Tez is an extensible framework for building high performance batch and interactive data processing applications coordinated by YARN in Apache Hadoop
26
1 Evolution bull lsquoSparkrsquo for lightning fast speedbull This is how Apache Spark is branding itself ldquoApache Sparktrade is a fast and general engine for large-scale data processingrdquo httpssparkapacheorg
bull Apache Spark is a general purpose cluster computing framework its execution model supports wide variety of use cases batch interactive near-real time
bull The rapid in-memory processing of resilient distributed datasets (RDDs) is the ldquocore capabilityrdquo of Apache Spark
27
1 Evolution Apache Flinkbull Flink German for ldquonimble swift speedyrdquobull This is how Apache Flink is branding itself ldquoFast and
reliable large-scale data processing enginerdquobull Apache Flink httpflinkapacheorg offers
bull Batch and Streaming in the same systembull Beyond DAGs (Cyclic operator graphs)bull Powerful expressive APIsbull Inside-the-system iterationsbull Full Hadoop compatibility bull Automatic language independent optimizer
bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink
28
Hadoop MapReduce vs Tez vs SparkCriteria
License Open SourceApache 20 version 2x
Open SourceApache 20 version 0x
Open SourceApache 20 version 1x
Processing Model
On-Disk (Disk- based parallelization) Batch
On-Disk Batch Interactive
In-Memory On-Disk Batch Interactive Streaming (Near Real-Time)
Language written in
Java Java Scala
API [Java Python Scala] User-Facing
Java[ ISVEngineTool builder]
[Scala Java Python] User-Facing
Libraries None separate tools None [Spark Core Spark Streaming Spark SQL MLlib GraphX]
29
Hadoop MapReduce vs Tez vs SparkCriteria
Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to Hadoop
Ease of Use Difficult to program needs abstractions
No Interactive mode except Hive Pig
Difficult to program
No Interactive mode except Hive Pig
Easy to program no need of abstractionsInteractive mode
Compatibility
to data types and data sources is same
to data types and data sources is same
to data types and data sources is same
YARN integration
YARN application Ground up YARN application
Spark is moving towards YARN
30
Hadoop MapReduce vs Tez vs SparkCriteria
Deployment YARN YARN [Standalone YARN SIMR Mesos hellip]
Performance - Good performance when data fits into memory
- performance degradation otherwise
Security More features and projects
More features and projects
Still in its infancy
Partial support
31
IV Spark with Hadoop
1 Evolution2 Transition3 Integration4 Complementarity5 Key Takeaways
32
2 Transitionbull Existing Hadoop MapReduce projects can migrate to Spark and leverage Spark Core as execution engine1 You can often reuse your mapper and
reducer functions and just call them in Spark from Java or Scala
2 You can translate your code from MapReduce to Apache Spark How-to Translate from MapReduce to Apache Spark
httpblogclouderacomblog201409how-to-translate-from-mapreduce-to-apache-spark
33
2 Transition3 The following tools originally based on Hadoop MapReduce are being ported to Apache Spark
bull Pig Hive Sqoop Cascading Crunch Mahout hellip
34
Pig on Spark (Spork)bull Run Pig with ldquondashx sparkrdquo option for an easy migration
without development effortbull Speed up your existing pig scripts on Spark ( Query
Logical Plan Physical Pan)bull Leverage new Spark specific operators in Pig such as
Cachebull Still leverage many existing Pig UDF librariesbull Pig on Spark Umbrella Jira (Status Passed end-to-end test
cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059bull Fix outstanding issues and address additional Spark functionality
through the community
bull lsquoPig on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag19
35
Hive on Spark (Currently in Beta Expected in Hive 110)
bull New alternative to using MapReduce or Tez hivegt set hiveexecutionengine=sparkbull Help existing Hive applications running on
MapReduce or Tez easily migrate to Spark without development effort
bull Exposes Spark users to a viable feature-rich de facto standard SQL tool on Hadoop
bull Performance benefits especially for Hive queries involving multiple reducer stages
bull Hive on Spark Umbrella Jira (Status Open) Q1 2015 httpsissuesapacheorgjirabrowseHIVE-7292
36
Hive on Spark (Currently in Beta Expected in Hive 110)
bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-motivations-and-design-principles
bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Started
bull Hive on Spark February 11 2015 Szehon Ho Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark
bull Hive on spark is blazing fast or is it Carter Shanklin and Mostapah Mokhtar (Hortonworks) February 20 2015 httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final
bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12
37
Sqoop on Spark (Expected in Sqoop 2)
bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop
bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources
bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal
bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532
38
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop
bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark
39
Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml
40
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
41
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
42
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
43
3 IntegrationService Open Source Tool
StorageServing Layer
Data Formats
Data Ingestion ServicesResource Management
Search
SQL
44
3 Integrationbull Spark was designed to read and write data from and to HDFS as
well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851
45
3 Integrationbull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase
46
3 Integration bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra
47
3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new avenues
bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra
48
3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo
httpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights
bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
49
3 Integration bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu
ction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
50
3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph
databasebull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
51
3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A
bull Some issues are critical ones bull Running Spark on YARN
httpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
52
3 Integrationbull Spark SQL provides built in support for Hive tables
bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib
53
3 Integration bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline
Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
54
3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka
55
3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
56
3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data
bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml
57
3 Integrationbull Apache Parquet is a columnar storage format available
to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
58
3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark
SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem
bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format
59
3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
60
3 Integration bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html
61
3 Integration
bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from
MapReduce to Sparkbull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed
62
3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014
63
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
64
4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them
Hadoop ecosystem Spark ecosystem
65
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
66
4 Complementarity + bull Mesos and YARN can work together each for what
it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom
httpsparkbigdatacomcomponenttagstag41
67
4 Complementarity + References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management
bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620
68
4 Complementarity + bull Spark on Tez for efficient ETL https
githubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)
bull Tez supports enterprise security
69
4 Complementarity + bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
70
4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml
71
4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop
72
5 Key Takeaways1 Evolution of compute models is still ongoing Watch
out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
73
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
74
1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt
he-perfect-match-apache-spark-meets-swift
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
3
AgendaI MotivationII Big Data Typical Big Data
Stack Apache Hadoop Apache Spark
III Spark with HadoopIV Spark without HadoopV More QampA
4
I Motivation1 News 2 Surveys3 Vendors4 Analysts5 Key Takeaways
5
1 Newsbull Is it Spark vs OR and Hadoopbull Apache Spark Hadoop friend or foebull Apache Spark killer or savior of Apache Hadoopbull Apache Sparks Marriage To Hadoop Will Be Bigger
Than Kim And Kanye bull Adios Hadoop Hola Sparkbull Apache Spark Moving on from Hadoopbull Apache Spark Continues to Spread Beyond Hadoopbull Escape From Hadoopbull Spark promises to end up Hadoop but in a good way
6
2 Surveysbull Hadoops historic focus on batch processing of data
was well supported by MapReduce but there is an appetite for more flexible developer tools to support the larger market of mid-size datasets and use cases that call for real-time processingrdquo 2015 Apache Spark Survey by Typesafe January 27 2015 httpwwwmarketwiredcompress-releasesurvey-indicates-apache-spark-gaining-developer-adoption-as-big-datas-projects-1986162htm
bull Apache Spark Preparing for the Next Wave of Reactive Big Data January 27 2015 by Typesafehttptypesafecomblogapache-spark-preparing-for-the-next-wave-of-reactive-big-data
7
Apache Spark Survey 2015 by Typesafe - Quick Snapshot
8
3 Vendorsbull Spark and Hadoop Working Together January 21
2014 by Ion Stoica httpsdatabrickscomblog20140121spark-and-hadoophtml
bull Uniform API for diverse workloads over diverse storage systems and runtimes Source Slide 16 of lsquoSparks Role in the Big Data Ecosystem (Spark Summit 2014) November 2014 Matei Zahariahttpwwwslidesharenetdatabricksspark-summit2014
bull The goal of Apache Spark is to have one engine for all data sources workloads and environmentsrdquo Source Slide 15 of lsquoNew Directions for Apache Spark in 2015 February 20 2015 Strata + Hadoop Summit Matei Zahariahttpwwwslidesharenetdatabricksnew-directions-for-apache-spark-in-2015
9
3 Vendorsbull ldquoSpark is already an excellent piece of software and is
advancing very quickly No vendor mdash no new project mdash is likely to catch up Chasing Spark would be a waste of time and would delay availability of real-time analytic and processing services for no good reason rdquoSource MapReduce and Spark December 302013 httpvisionclouderacommapreduce-spark
bull ldquoApache Spark is an open source parallel data processing framework that complements Apache Hadoop to make it easy to develop fast unified Big Data applications combining batch streaming and interactive analytics on all your datardquohttpwwwclouderacomcontentclouderaenproducts-and-servicescdhsparkhtml
10
3 Vendorsbull ldquoApache Spark is a general-purpose engine for large-
scale data processing Spark supports rapid application development for big data and allows for code reuse across batch interactive and streaming applications Spark also provides advanced execution graphs with in-memory pipelining to speed up end-to-end application performancerdquo httpswwwmaprcomproductsapache-spark
bull MapR Adds Complete Apache Spark Stack to its Distribution for Hadoophttpswwwmaprcomcompanypress-releasesmapr-adds-complete-apache-spark-stack-its-distribution-hadoop
11
3 Vendorsbull ldquoApache Spark provides an elegant attractive development API and allows data workers to rapidly iterate over data via machine learning and other data science techniques that require fast in-memory data processingrdquo httphortonworkscomhadoopspark
bull Hortonworks A shared vision for Apache Spark on Hadoop October 21 2014httpsdatabrickscomblog20141031hortonworks-a-shared-vision-for-apache-spark-on-hadoophtml
bull ldquoAt Hortonworks we love Spark and want to help our customers leverage all its benefitsrdquo October 30th 2014httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
12
4 Analystsbull Is Apache Spark replacing Hadoop or complementing
existing Hadoop practicebull Both are already happening
bull With uncertainty about ldquowhat is Hadooprdquo there is no reason to think solution stacks built on Spark not positioned as Hadoop will not continue to proliferate as the technology matures
bull At the same time Hadoop distributions are all embracing Spark and including it in their offerings
Source Hadoop Questions from Recent Webinar Span Spectrum February 25 2015httpblogsgartnercommerv-adrian20150225hadoop-questions-from-recent-webinar-span-spectrum
13
4 Analysts bull ldquoAfter hearing the confusion between Spark and Hadoop
one too many times I was inspired to write a report The Hadoop Ecosystem Overview Q4 2104
bull For those that have day jobs that donrsquot include constantly tracking Hadoop evolution I dove in and worked with Hadoop vendors and trusted consultants to create a framework
bull We divided the complex Hadoop ecosystem into a core set of tools that all work closely with data stored in Hadoop File System and extended group of components that leverage but do not require itrdquo Source Elephants Pigs Rhinos and Giraphs Oh My ndash Its Time To Get A Handle On Hadoop Posted by Brian Hopkins on November 26 2014
httpblogsforrestercombrian_hopkins14-11-26-elephants_pigs_rhinos_and_giraphs_oh_my_its_time_to_get_a_handle_on_hadoop
14
5 Key Takeaways1 News Big Data is no longer a Hadoop
monopoly2 Surveys Listen to what Spark developers are
saying 3 Vendors ltHadoop Vendorgt-tinted goggles
FUD is still being lsquoofferedrsquo by some Hadoop vendors Claims need to be contextualized
4 Analysts Thorough understanding of the market dynamics
15
II Big Data Typical Big Data Stack Hadoop Spark
1 Big Data2 Typical Big Data Stack 3 Apache Hadoop4 Apache Spark5 Key Takeaways
16
1 Big Databull Big Data is still one of the most inflated buzzword of the last years
bull Big Data is a broad term for data sets so large or complex that traditional data processing tools are inadequate httpenwikipediaorgwikiBig_data
bull Hadoop is becoming a traditional tool Above definition is inadequate
bull ldquoBig Data refers to datasets and flows large enough that has outpaced our capability to store process analyze and understandrdquo Amir H Payberah Swedish Institute of Computer Science (SICS)
17
2 Typical Big Data Stack
18
3 Apache Hadoopbull Apache Hadoop as an example of a Typical Big Data
Stack bull Hadoop ecosystem = Hadoop Stack + many other tools
(either open source and free or commercial ones)bull Big Data Ecosystem Dataset httpbigdataandreamostosiname
Incomplete but a useful list of Big Data related projects packed into a JSON dataset
bull Hadoops Impact on Data Managements Future - Amr Awadallah (Strata + Hadoop 2015) February 19 2015 Watch video at 236 on lsquoHadoop Isnrsquot Just Hadoop Anymorersquo for a picture representing the evolution of Apache Hadoop httpswwwyoutubecomwatchv=1KvTZZAkHy0
19
4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stackbull Apache Spark provides you Big Data computing and more
bull BYOS Bring Your Own Storage bull BYOC Bring Your Own Clusterbull Spark Core httpsparkbigdatacomcomponenttagstag11-core-sparkbull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-
streamingbull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sqlbull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllibbull GraphX httpsparkbigdatacomcomponenttagstag6-graphx
bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned
20
5 Key Takeaways1 Big Data Still one of the most inflated
buzzword 2 Typical Big Data Stack Big Data Stacks look
similar on paper Arenrsquot they3 Apache Hadoop Hadoop is no longer
lsquosynonymousrsquo of Big Data4 Apache Spark Emergence of the Apache
Spark ecosystem
21
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
22
1 Evolution of Programming APIsbull MapReduce in Java is like assembly code of Big
Data httpwikiapacheorghadoopWordCount
bull Pig httppigapacheorg
bull Hive httphiveapacheorg
bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi
bull Cascading httpwwwcascadingorg
bull Scalding A Scala API for Cascading httptwittercomscalding
bull Crunch httpcrunchapacheorg
bull Scrunch httpcrunchapacheorgscrunchhtml
23
1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007 MapReduce v1 was the only choice as a compute model (Execution Engine) on Hadoop Now we have in addition to MapReduce v2 Tez Spark and Flink
bull Batch bull Batchbull Interactive
bull Batchbull Interactivebull Near-Real
time
bull Batchbull Interactivebull Real-Timebull Iterative
bull 1st Generation
bull 2nd
Generationbull 3rd
Generationbull 4th
Generation
24
1 Evolutionbull This is how Hadoop MapReduce is branding itself ldquoA YARN-based
system for parallel processing of large data sets httphadoopapacheorg
bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip
bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job
bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics
25
1 Evolutionbull Tez Hindi for ldquospeedrdquobull This is how Apache Tez is branding itself ldquoThe Apache Tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data It is currently built atop YARNrdquo
Source httptezapacheorg
bull Apachetrade Tez is an extensible framework for building high performance batch and interactive data processing applications coordinated by YARN in Apache Hadoop
26
1 Evolution bull lsquoSparkrsquo for lightning fast speedbull This is how Apache Spark is branding itself ldquoApache Sparktrade is a fast and general engine for large-scale data processingrdquo httpssparkapacheorg
bull Apache Spark is a general purpose cluster computing framework its execution model supports wide variety of use cases batch interactive near-real time
bull The rapid in-memory processing of resilient distributed datasets (RDDs) is the ldquocore capabilityrdquo of Apache Spark
27
1 Evolution Apache Flinkbull Flink German for ldquonimble swift speedyrdquobull This is how Apache Flink is branding itself ldquoFast and
reliable large-scale data processing enginerdquobull Apache Flink httpflinkapacheorg offers
bull Batch and Streaming in the same systembull Beyond DAGs (Cyclic operator graphs)bull Powerful expressive APIsbull Inside-the-system iterationsbull Full Hadoop compatibility bull Automatic language independent optimizer
bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink
28
Hadoop MapReduce vs Tez vs SparkCriteria
License Open SourceApache 20 version 2x
Open SourceApache 20 version 0x
Open SourceApache 20 version 1x
Processing Model
On-Disk (Disk- based parallelization) Batch
On-Disk Batch Interactive
In-Memory On-Disk Batch Interactive Streaming (Near Real-Time)
Language written in
Java Java Scala
API [Java Python Scala] User-Facing
Java[ ISVEngineTool builder]
[Scala Java Python] User-Facing
Libraries None separate tools None [Spark Core Spark Streaming Spark SQL MLlib GraphX]
29
Hadoop MapReduce vs Tez vs SparkCriteria
Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to Hadoop
Ease of Use Difficult to program needs abstractions
No Interactive mode except Hive Pig
Difficult to program
No Interactive mode except Hive Pig
Easy to program no need of abstractionsInteractive mode
Compatibility
to data types and data sources is same
to data types and data sources is same
to data types and data sources is same
YARN integration
YARN application Ground up YARN application
Spark is moving towards YARN
30
Hadoop MapReduce vs Tez vs SparkCriteria
Deployment YARN YARN [Standalone YARN SIMR Mesos hellip]
Performance - Good performance when data fits into memory
- performance degradation otherwise
Security More features and projects
More features and projects
Still in its infancy
Partial support
31
IV Spark with Hadoop
1 Evolution2 Transition3 Integration4 Complementarity5 Key Takeaways
32
2 Transitionbull Existing Hadoop MapReduce projects can migrate to Spark and leverage Spark Core as execution engine1 You can often reuse your mapper and
reducer functions and just call them in Spark from Java or Scala
2 You can translate your code from MapReduce to Apache Spark How-to Translate from MapReduce to Apache Spark
httpblogclouderacomblog201409how-to-translate-from-mapreduce-to-apache-spark
33
2 Transition3 The following tools originally based on Hadoop MapReduce are being ported to Apache Spark
bull Pig Hive Sqoop Cascading Crunch Mahout hellip
34
Pig on Spark (Spork)bull Run Pig with ldquondashx sparkrdquo option for an easy migration
without development effortbull Speed up your existing pig scripts on Spark ( Query
Logical Plan Physical Pan)bull Leverage new Spark specific operators in Pig such as
Cachebull Still leverage many existing Pig UDF librariesbull Pig on Spark Umbrella Jira (Status Passed end-to-end test
cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059bull Fix outstanding issues and address additional Spark functionality
through the community
bull lsquoPig on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag19
35
Hive on Spark (Currently in Beta Expected in Hive 110)
bull New alternative to using MapReduce or Tez hivegt set hiveexecutionengine=sparkbull Help existing Hive applications running on
MapReduce or Tez easily migrate to Spark without development effort
bull Exposes Spark users to a viable feature-rich de facto standard SQL tool on Hadoop
bull Performance benefits especially for Hive queries involving multiple reducer stages
bull Hive on Spark Umbrella Jira (Status Open) Q1 2015 httpsissuesapacheorgjirabrowseHIVE-7292
36
Hive on Spark (Currently in Beta Expected in Hive 110)
bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-motivations-and-design-principles
bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Started
bull Hive on Spark February 11 2015 Szehon Ho Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark
bull Hive on spark is blazing fast or is it Carter Shanklin and Mostapah Mokhtar (Hortonworks) February 20 2015 httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final
bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12
37
Sqoop on Spark (Expected in Sqoop 2)
bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop
bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources
bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal
bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532
38
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop
bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark
39
Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml
40
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
41
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
42
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
43
3 IntegrationService Open Source Tool
StorageServing Layer
Data Formats
Data Ingestion ServicesResource Management
Search
SQL
44
3 Integrationbull Spark was designed to read and write data from and to HDFS as
well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851
45
3 Integrationbull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase
46
3 Integration bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra
47
3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new avenues
bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra
48
3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo
httpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights
bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
49
3 Integration bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu
ction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
50
3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph
databasebull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
51
3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A
bull Some issues are critical ones bull Running Spark on YARN
httpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
52
3 Integrationbull Spark SQL provides built in support for Hive tables
bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib
53
3 Integration bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline
Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
54
3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka
55
3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
56
3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data
bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml
57
3 Integrationbull Apache Parquet is a columnar storage format available
to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
58
3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark
SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem
bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format
59
3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
60
3 Integration bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html
61
3 Integration
bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from
MapReduce to Sparkbull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed
62
3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014
63
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
64
4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them
Hadoop ecosystem Spark ecosystem
65
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
66
4 Complementarity + bull Mesos and YARN can work together each for what
it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom
httpsparkbigdatacomcomponenttagstag41
67
4 Complementarity + References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management
bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620
68
4 Complementarity + bull Spark on Tez for efficient ETL https
githubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)
bull Tez supports enterprise security
69
4 Complementarity + bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
70
4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml
71
4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop
72
5 Key Takeaways1 Evolution of compute models is still ongoing Watch
out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
73
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
74
1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt
he-perfect-match-apache-spark-meets-swift
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
4
I Motivation1 News 2 Surveys3 Vendors4 Analysts5 Key Takeaways
5
1 Newsbull Is it Spark vs OR and Hadoopbull Apache Spark Hadoop friend or foebull Apache Spark killer or savior of Apache Hadoopbull Apache Sparks Marriage To Hadoop Will Be Bigger
Than Kim And Kanye bull Adios Hadoop Hola Sparkbull Apache Spark Moving on from Hadoopbull Apache Spark Continues to Spread Beyond Hadoopbull Escape From Hadoopbull Spark promises to end up Hadoop but in a good way
6
2 Surveysbull Hadoops historic focus on batch processing of data
was well supported by MapReduce but there is an appetite for more flexible developer tools to support the larger market of mid-size datasets and use cases that call for real-time processingrdquo 2015 Apache Spark Survey by Typesafe January 27 2015 httpwwwmarketwiredcompress-releasesurvey-indicates-apache-spark-gaining-developer-adoption-as-big-datas-projects-1986162htm
bull Apache Spark Preparing for the Next Wave of Reactive Big Data January 27 2015 by Typesafehttptypesafecomblogapache-spark-preparing-for-the-next-wave-of-reactive-big-data
7
Apache Spark Survey 2015 by Typesafe - Quick Snapshot
8
3 Vendorsbull Spark and Hadoop Working Together January 21
2014 by Ion Stoica httpsdatabrickscomblog20140121spark-and-hadoophtml
bull Uniform API for diverse workloads over diverse storage systems and runtimes Source Slide 16 of lsquoSparks Role in the Big Data Ecosystem (Spark Summit 2014) November 2014 Matei Zahariahttpwwwslidesharenetdatabricksspark-summit2014
bull The goal of Apache Spark is to have one engine for all data sources workloads and environmentsrdquo Source Slide 15 of lsquoNew Directions for Apache Spark in 2015 February 20 2015 Strata + Hadoop Summit Matei Zahariahttpwwwslidesharenetdatabricksnew-directions-for-apache-spark-in-2015
9
3 Vendorsbull ldquoSpark is already an excellent piece of software and is
advancing very quickly No vendor mdash no new project mdash is likely to catch up Chasing Spark would be a waste of time and would delay availability of real-time analytic and processing services for no good reason rdquoSource MapReduce and Spark December 302013 httpvisionclouderacommapreduce-spark
bull ldquoApache Spark is an open source parallel data processing framework that complements Apache Hadoop to make it easy to develop fast unified Big Data applications combining batch streaming and interactive analytics on all your datardquohttpwwwclouderacomcontentclouderaenproducts-and-servicescdhsparkhtml
10
3 Vendorsbull ldquoApache Spark is a general-purpose engine for large-
scale data processing Spark supports rapid application development for big data and allows for code reuse across batch interactive and streaming applications Spark also provides advanced execution graphs with in-memory pipelining to speed up end-to-end application performancerdquo httpswwwmaprcomproductsapache-spark
bull MapR Adds Complete Apache Spark Stack to its Distribution for Hadoophttpswwwmaprcomcompanypress-releasesmapr-adds-complete-apache-spark-stack-its-distribution-hadoop
11
3 Vendorsbull ldquoApache Spark provides an elegant attractive development API and allows data workers to rapidly iterate over data via machine learning and other data science techniques that require fast in-memory data processingrdquo httphortonworkscomhadoopspark
bull Hortonworks A shared vision for Apache Spark on Hadoop October 21 2014httpsdatabrickscomblog20141031hortonworks-a-shared-vision-for-apache-spark-on-hadoophtml
bull ldquoAt Hortonworks we love Spark and want to help our customers leverage all its benefitsrdquo October 30th 2014httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
12
4 Analystsbull Is Apache Spark replacing Hadoop or complementing
existing Hadoop practicebull Both are already happening
bull With uncertainty about ldquowhat is Hadooprdquo there is no reason to think solution stacks built on Spark not positioned as Hadoop will not continue to proliferate as the technology matures
bull At the same time Hadoop distributions are all embracing Spark and including it in their offerings
Source Hadoop Questions from Recent Webinar Span Spectrum February 25 2015httpblogsgartnercommerv-adrian20150225hadoop-questions-from-recent-webinar-span-spectrum
13
4 Analysts bull ldquoAfter hearing the confusion between Spark and Hadoop
one too many times I was inspired to write a report The Hadoop Ecosystem Overview Q4 2104
bull For those that have day jobs that donrsquot include constantly tracking Hadoop evolution I dove in and worked with Hadoop vendors and trusted consultants to create a framework
bull We divided the complex Hadoop ecosystem into a core set of tools that all work closely with data stored in Hadoop File System and extended group of components that leverage but do not require itrdquo Source Elephants Pigs Rhinos and Giraphs Oh My ndash Its Time To Get A Handle On Hadoop Posted by Brian Hopkins on November 26 2014
httpblogsforrestercombrian_hopkins14-11-26-elephants_pigs_rhinos_and_giraphs_oh_my_its_time_to_get_a_handle_on_hadoop
14
5 Key Takeaways1 News Big Data is no longer a Hadoop
monopoly2 Surveys Listen to what Spark developers are
saying 3 Vendors ltHadoop Vendorgt-tinted goggles
FUD is still being lsquoofferedrsquo by some Hadoop vendors Claims need to be contextualized
4 Analysts Thorough understanding of the market dynamics
15
II Big Data Typical Big Data Stack Hadoop Spark
1 Big Data2 Typical Big Data Stack 3 Apache Hadoop4 Apache Spark5 Key Takeaways
16
1 Big Databull Big Data is still one of the most inflated buzzword of the last years
bull Big Data is a broad term for data sets so large or complex that traditional data processing tools are inadequate httpenwikipediaorgwikiBig_data
bull Hadoop is becoming a traditional tool Above definition is inadequate
bull ldquoBig Data refers to datasets and flows large enough that has outpaced our capability to store process analyze and understandrdquo Amir H Payberah Swedish Institute of Computer Science (SICS)
17
2 Typical Big Data Stack
18
3 Apache Hadoopbull Apache Hadoop as an example of a Typical Big Data
Stack bull Hadoop ecosystem = Hadoop Stack + many other tools
(either open source and free or commercial ones)bull Big Data Ecosystem Dataset httpbigdataandreamostosiname
Incomplete but a useful list of Big Data related projects packed into a JSON dataset
bull Hadoops Impact on Data Managements Future - Amr Awadallah (Strata + Hadoop 2015) February 19 2015 Watch video at 236 on lsquoHadoop Isnrsquot Just Hadoop Anymorersquo for a picture representing the evolution of Apache Hadoop httpswwwyoutubecomwatchv=1KvTZZAkHy0
19
4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stackbull Apache Spark provides you Big Data computing and more
bull BYOS Bring Your Own Storage bull BYOC Bring Your Own Clusterbull Spark Core httpsparkbigdatacomcomponenttagstag11-core-sparkbull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-
streamingbull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sqlbull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllibbull GraphX httpsparkbigdatacomcomponenttagstag6-graphx
bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned
20
5 Key Takeaways1 Big Data Still one of the most inflated
buzzword 2 Typical Big Data Stack Big Data Stacks look
similar on paper Arenrsquot they3 Apache Hadoop Hadoop is no longer
lsquosynonymousrsquo of Big Data4 Apache Spark Emergence of the Apache
Spark ecosystem
21
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
22
1 Evolution of Programming APIsbull MapReduce in Java is like assembly code of Big
Data httpwikiapacheorghadoopWordCount
bull Pig httppigapacheorg
bull Hive httphiveapacheorg
bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi
bull Cascading httpwwwcascadingorg
bull Scalding A Scala API for Cascading httptwittercomscalding
bull Crunch httpcrunchapacheorg
bull Scrunch httpcrunchapacheorgscrunchhtml
23
1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007 MapReduce v1 was the only choice as a compute model (Execution Engine) on Hadoop Now we have in addition to MapReduce v2 Tez Spark and Flink
bull Batch bull Batchbull Interactive
bull Batchbull Interactivebull Near-Real
time
bull Batchbull Interactivebull Real-Timebull Iterative
bull 1st Generation
bull 2nd
Generationbull 3rd
Generationbull 4th
Generation
24
1 Evolutionbull This is how Hadoop MapReduce is branding itself ldquoA YARN-based
system for parallel processing of large data sets httphadoopapacheorg
bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip
bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job
bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics
25
1 Evolutionbull Tez Hindi for ldquospeedrdquobull This is how Apache Tez is branding itself ldquoThe Apache Tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data It is currently built atop YARNrdquo
Source httptezapacheorg
bull Apachetrade Tez is an extensible framework for building high performance batch and interactive data processing applications coordinated by YARN in Apache Hadoop
26
1 Evolution bull lsquoSparkrsquo for lightning fast speedbull This is how Apache Spark is branding itself ldquoApache Sparktrade is a fast and general engine for large-scale data processingrdquo httpssparkapacheorg
bull Apache Spark is a general purpose cluster computing framework its execution model supports wide variety of use cases batch interactive near-real time
bull The rapid in-memory processing of resilient distributed datasets (RDDs) is the ldquocore capabilityrdquo of Apache Spark
27
1 Evolution Apache Flinkbull Flink German for ldquonimble swift speedyrdquobull This is how Apache Flink is branding itself ldquoFast and
reliable large-scale data processing enginerdquobull Apache Flink httpflinkapacheorg offers
bull Batch and Streaming in the same systembull Beyond DAGs (Cyclic operator graphs)bull Powerful expressive APIsbull Inside-the-system iterationsbull Full Hadoop compatibility bull Automatic language independent optimizer
bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink
28
Hadoop MapReduce vs Tez vs SparkCriteria
License Open SourceApache 20 version 2x
Open SourceApache 20 version 0x
Open SourceApache 20 version 1x
Processing Model
On-Disk (Disk- based parallelization) Batch
On-Disk Batch Interactive
In-Memory On-Disk Batch Interactive Streaming (Near Real-Time)
Language written in
Java Java Scala
API [Java Python Scala] User-Facing
Java[ ISVEngineTool builder]
[Scala Java Python] User-Facing
Libraries None separate tools None [Spark Core Spark Streaming Spark SQL MLlib GraphX]
29
Hadoop MapReduce vs Tez vs SparkCriteria
Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to Hadoop
Ease of Use Difficult to program needs abstractions
No Interactive mode except Hive Pig
Difficult to program
No Interactive mode except Hive Pig
Easy to program no need of abstractionsInteractive mode
Compatibility
to data types and data sources is same
to data types and data sources is same
to data types and data sources is same
YARN integration
YARN application Ground up YARN application
Spark is moving towards YARN
30
Hadoop MapReduce vs Tez vs SparkCriteria
Deployment YARN YARN [Standalone YARN SIMR Mesos hellip]
Performance - Good performance when data fits into memory
- performance degradation otherwise
Security More features and projects
More features and projects
Still in its infancy
Partial support
31
IV Spark with Hadoop
1 Evolution2 Transition3 Integration4 Complementarity5 Key Takeaways
32
2 Transitionbull Existing Hadoop MapReduce projects can migrate to Spark and leverage Spark Core as execution engine1 You can often reuse your mapper and
reducer functions and just call them in Spark from Java or Scala
2 You can translate your code from MapReduce to Apache Spark How-to Translate from MapReduce to Apache Spark
httpblogclouderacomblog201409how-to-translate-from-mapreduce-to-apache-spark
33
2 Transition3 The following tools originally based on Hadoop MapReduce are being ported to Apache Spark
bull Pig Hive Sqoop Cascading Crunch Mahout hellip
34
Pig on Spark (Spork)bull Run Pig with ldquondashx sparkrdquo option for an easy migration
without development effortbull Speed up your existing pig scripts on Spark ( Query
Logical Plan Physical Pan)bull Leverage new Spark specific operators in Pig such as
Cachebull Still leverage many existing Pig UDF librariesbull Pig on Spark Umbrella Jira (Status Passed end-to-end test
cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059bull Fix outstanding issues and address additional Spark functionality
through the community
bull lsquoPig on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag19
35
Hive on Spark (Currently in Beta Expected in Hive 110)
bull New alternative to using MapReduce or Tez hivegt set hiveexecutionengine=sparkbull Help existing Hive applications running on
MapReduce or Tez easily migrate to Spark without development effort
bull Exposes Spark users to a viable feature-rich de facto standard SQL tool on Hadoop
bull Performance benefits especially for Hive queries involving multiple reducer stages
bull Hive on Spark Umbrella Jira (Status Open) Q1 2015 httpsissuesapacheorgjirabrowseHIVE-7292
36
Hive on Spark (Currently in Beta Expected in Hive 110)
bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-motivations-and-design-principles
bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Started
bull Hive on Spark February 11 2015 Szehon Ho Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark
bull Hive on spark is blazing fast or is it Carter Shanklin and Mostapah Mokhtar (Hortonworks) February 20 2015 httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final
bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12
37
Sqoop on Spark (Expected in Sqoop 2)
bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop
bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources
bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal
bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532
38
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop
bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark
39
Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml
40
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
41
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
42
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
43
3 IntegrationService Open Source Tool
StorageServing Layer
Data Formats
Data Ingestion ServicesResource Management
Search
SQL
44
3 Integrationbull Spark was designed to read and write data from and to HDFS as
well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851
45
3 Integrationbull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase
46
3 Integration bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra
47
3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new avenues
bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra
48
3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo
httpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights
bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
49
3 Integration bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu
ction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
50
3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph
databasebull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
51
3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A
bull Some issues are critical ones bull Running Spark on YARN
httpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
52
3 Integrationbull Spark SQL provides built in support for Hive tables
bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib
53
3 Integration bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline
Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
54
3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka
55
3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
56
3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data
bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml
57
3 Integrationbull Apache Parquet is a columnar storage format available
to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
58
3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark
SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem
bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format
59
3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
60
3 Integration bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html
61
3 Integration
bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from
MapReduce to Sparkbull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed
62
3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014
63
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
64
4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them
Hadoop ecosystem Spark ecosystem
65
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
66
4 Complementarity + bull Mesos and YARN can work together each for what
it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom
httpsparkbigdatacomcomponenttagstag41
67
4 Complementarity + References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management
bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620
68
4 Complementarity + bull Spark on Tez for efficient ETL https
githubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)
bull Tez supports enterprise security
69
4 Complementarity + bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
70
4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml
71
4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop
72
5 Key Takeaways1 Evolution of compute models is still ongoing Watch
out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
73
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
74
1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt
he-perfect-match-apache-spark-meets-swift
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
5
1 Newsbull Is it Spark vs OR and Hadoopbull Apache Spark Hadoop friend or foebull Apache Spark killer or savior of Apache Hadoopbull Apache Sparks Marriage To Hadoop Will Be Bigger
Than Kim And Kanye bull Adios Hadoop Hola Sparkbull Apache Spark Moving on from Hadoopbull Apache Spark Continues to Spread Beyond Hadoopbull Escape From Hadoopbull Spark promises to end up Hadoop but in a good way
6
2 Surveysbull Hadoops historic focus on batch processing of data
was well supported by MapReduce but there is an appetite for more flexible developer tools to support the larger market of mid-size datasets and use cases that call for real-time processingrdquo 2015 Apache Spark Survey by Typesafe January 27 2015 httpwwwmarketwiredcompress-releasesurvey-indicates-apache-spark-gaining-developer-adoption-as-big-datas-projects-1986162htm
bull Apache Spark Preparing for the Next Wave of Reactive Big Data January 27 2015 by Typesafehttptypesafecomblogapache-spark-preparing-for-the-next-wave-of-reactive-big-data
7
Apache Spark Survey 2015 by Typesafe - Quick Snapshot
8
3 Vendorsbull Spark and Hadoop Working Together January 21
2014 by Ion Stoica httpsdatabrickscomblog20140121spark-and-hadoophtml
bull Uniform API for diverse workloads over diverse storage systems and runtimes Source Slide 16 of lsquoSparks Role in the Big Data Ecosystem (Spark Summit 2014) November 2014 Matei Zahariahttpwwwslidesharenetdatabricksspark-summit2014
bull The goal of Apache Spark is to have one engine for all data sources workloads and environmentsrdquo Source Slide 15 of lsquoNew Directions for Apache Spark in 2015 February 20 2015 Strata + Hadoop Summit Matei Zahariahttpwwwslidesharenetdatabricksnew-directions-for-apache-spark-in-2015
9
3 Vendorsbull ldquoSpark is already an excellent piece of software and is
advancing very quickly No vendor mdash no new project mdash is likely to catch up Chasing Spark would be a waste of time and would delay availability of real-time analytic and processing services for no good reason rdquoSource MapReduce and Spark December 302013 httpvisionclouderacommapreduce-spark
bull ldquoApache Spark is an open source parallel data processing framework that complements Apache Hadoop to make it easy to develop fast unified Big Data applications combining batch streaming and interactive analytics on all your datardquohttpwwwclouderacomcontentclouderaenproducts-and-servicescdhsparkhtml
10
3 Vendorsbull ldquoApache Spark is a general-purpose engine for large-
scale data processing Spark supports rapid application development for big data and allows for code reuse across batch interactive and streaming applications Spark also provides advanced execution graphs with in-memory pipelining to speed up end-to-end application performancerdquo httpswwwmaprcomproductsapache-spark
bull MapR Adds Complete Apache Spark Stack to its Distribution for Hadoophttpswwwmaprcomcompanypress-releasesmapr-adds-complete-apache-spark-stack-its-distribution-hadoop
11
3 Vendorsbull ldquoApache Spark provides an elegant attractive development API and allows data workers to rapidly iterate over data via machine learning and other data science techniques that require fast in-memory data processingrdquo httphortonworkscomhadoopspark
bull Hortonworks A shared vision for Apache Spark on Hadoop October 21 2014httpsdatabrickscomblog20141031hortonworks-a-shared-vision-for-apache-spark-on-hadoophtml
bull ldquoAt Hortonworks we love Spark and want to help our customers leverage all its benefitsrdquo October 30th 2014httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
12
4 Analystsbull Is Apache Spark replacing Hadoop or complementing
existing Hadoop practicebull Both are already happening
bull With uncertainty about ldquowhat is Hadooprdquo there is no reason to think solution stacks built on Spark not positioned as Hadoop will not continue to proliferate as the technology matures
bull At the same time Hadoop distributions are all embracing Spark and including it in their offerings
Source Hadoop Questions from Recent Webinar Span Spectrum February 25 2015httpblogsgartnercommerv-adrian20150225hadoop-questions-from-recent-webinar-span-spectrum
13
4 Analysts bull ldquoAfter hearing the confusion between Spark and Hadoop
one too many times I was inspired to write a report The Hadoop Ecosystem Overview Q4 2104
bull For those that have day jobs that donrsquot include constantly tracking Hadoop evolution I dove in and worked with Hadoop vendors and trusted consultants to create a framework
bull We divided the complex Hadoop ecosystem into a core set of tools that all work closely with data stored in Hadoop File System and extended group of components that leverage but do not require itrdquo Source Elephants Pigs Rhinos and Giraphs Oh My ndash Its Time To Get A Handle On Hadoop Posted by Brian Hopkins on November 26 2014
httpblogsforrestercombrian_hopkins14-11-26-elephants_pigs_rhinos_and_giraphs_oh_my_its_time_to_get_a_handle_on_hadoop
14
5 Key Takeaways1 News Big Data is no longer a Hadoop
monopoly2 Surveys Listen to what Spark developers are
saying 3 Vendors ltHadoop Vendorgt-tinted goggles
FUD is still being lsquoofferedrsquo by some Hadoop vendors Claims need to be contextualized
4 Analysts Thorough understanding of the market dynamics
15
II Big Data Typical Big Data Stack Hadoop Spark
1 Big Data2 Typical Big Data Stack 3 Apache Hadoop4 Apache Spark5 Key Takeaways
16
1 Big Databull Big Data is still one of the most inflated buzzword of the last years
bull Big Data is a broad term for data sets so large or complex that traditional data processing tools are inadequate httpenwikipediaorgwikiBig_data
bull Hadoop is becoming a traditional tool Above definition is inadequate
bull ldquoBig Data refers to datasets and flows large enough that has outpaced our capability to store process analyze and understandrdquo Amir H Payberah Swedish Institute of Computer Science (SICS)
17
2 Typical Big Data Stack
18
3 Apache Hadoopbull Apache Hadoop as an example of a Typical Big Data
Stack bull Hadoop ecosystem = Hadoop Stack + many other tools
(either open source and free or commercial ones)bull Big Data Ecosystem Dataset httpbigdataandreamostosiname
Incomplete but a useful list of Big Data related projects packed into a JSON dataset
bull Hadoops Impact on Data Managements Future - Amr Awadallah (Strata + Hadoop 2015) February 19 2015 Watch video at 236 on lsquoHadoop Isnrsquot Just Hadoop Anymorersquo for a picture representing the evolution of Apache Hadoop httpswwwyoutubecomwatchv=1KvTZZAkHy0
19
4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stackbull Apache Spark provides you Big Data computing and more
bull BYOS Bring Your Own Storage bull BYOC Bring Your Own Clusterbull Spark Core httpsparkbigdatacomcomponenttagstag11-core-sparkbull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-
streamingbull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sqlbull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllibbull GraphX httpsparkbigdatacomcomponenttagstag6-graphx
bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned
20
5 Key Takeaways1 Big Data Still one of the most inflated
buzzword 2 Typical Big Data Stack Big Data Stacks look
similar on paper Arenrsquot they3 Apache Hadoop Hadoop is no longer
lsquosynonymousrsquo of Big Data4 Apache Spark Emergence of the Apache
Spark ecosystem
21
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
22
1 Evolution of Programming APIsbull MapReduce in Java is like assembly code of Big
Data httpwikiapacheorghadoopWordCount
bull Pig httppigapacheorg
bull Hive httphiveapacheorg
bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi
bull Cascading httpwwwcascadingorg
bull Scalding A Scala API for Cascading httptwittercomscalding
bull Crunch httpcrunchapacheorg
bull Scrunch httpcrunchapacheorgscrunchhtml
23
1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007 MapReduce v1 was the only choice as a compute model (Execution Engine) on Hadoop Now we have in addition to MapReduce v2 Tez Spark and Flink
bull Batch bull Batchbull Interactive
bull Batchbull Interactivebull Near-Real
time
bull Batchbull Interactivebull Real-Timebull Iterative
bull 1st Generation
bull 2nd
Generationbull 3rd
Generationbull 4th
Generation
24
1 Evolutionbull This is how Hadoop MapReduce is branding itself ldquoA YARN-based
system for parallel processing of large data sets httphadoopapacheorg
bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip
bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job
bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics
25
1 Evolutionbull Tez Hindi for ldquospeedrdquobull This is how Apache Tez is branding itself ldquoThe Apache Tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data It is currently built atop YARNrdquo
Source httptezapacheorg
bull Apachetrade Tez is an extensible framework for building high performance batch and interactive data processing applications coordinated by YARN in Apache Hadoop
26
1 Evolution bull lsquoSparkrsquo for lightning fast speedbull This is how Apache Spark is branding itself ldquoApache Sparktrade is a fast and general engine for large-scale data processingrdquo httpssparkapacheorg
bull Apache Spark is a general purpose cluster computing framework its execution model supports wide variety of use cases batch interactive near-real time
bull The rapid in-memory processing of resilient distributed datasets (RDDs) is the ldquocore capabilityrdquo of Apache Spark
27
1 Evolution Apache Flinkbull Flink German for ldquonimble swift speedyrdquobull This is how Apache Flink is branding itself ldquoFast and
reliable large-scale data processing enginerdquobull Apache Flink httpflinkapacheorg offers
bull Batch and Streaming in the same systembull Beyond DAGs (Cyclic operator graphs)bull Powerful expressive APIsbull Inside-the-system iterationsbull Full Hadoop compatibility bull Automatic language independent optimizer
bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink
28
Hadoop MapReduce vs Tez vs SparkCriteria
License Open SourceApache 20 version 2x
Open SourceApache 20 version 0x
Open SourceApache 20 version 1x
Processing Model
On-Disk (Disk- based parallelization) Batch
On-Disk Batch Interactive
In-Memory On-Disk Batch Interactive Streaming (Near Real-Time)
Language written in
Java Java Scala
API [Java Python Scala] User-Facing
Java[ ISVEngineTool builder]
[Scala Java Python] User-Facing
Libraries None separate tools None [Spark Core Spark Streaming Spark SQL MLlib GraphX]
29
Hadoop MapReduce vs Tez vs SparkCriteria
Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to Hadoop
Ease of Use Difficult to program needs abstractions
No Interactive mode except Hive Pig
Difficult to program
No Interactive mode except Hive Pig
Easy to program no need of abstractionsInteractive mode
Compatibility
to data types and data sources is same
to data types and data sources is same
to data types and data sources is same
YARN integration
YARN application Ground up YARN application
Spark is moving towards YARN
30
Hadoop MapReduce vs Tez vs SparkCriteria
Deployment YARN YARN [Standalone YARN SIMR Mesos hellip]
Performance - Good performance when data fits into memory
- performance degradation otherwise
Security More features and projects
More features and projects
Still in its infancy
Partial support
31
IV Spark with Hadoop
1 Evolution2 Transition3 Integration4 Complementarity5 Key Takeaways
32
2 Transitionbull Existing Hadoop MapReduce projects can migrate to Spark and leverage Spark Core as execution engine1 You can often reuse your mapper and
reducer functions and just call them in Spark from Java or Scala
2 You can translate your code from MapReduce to Apache Spark How-to Translate from MapReduce to Apache Spark
httpblogclouderacomblog201409how-to-translate-from-mapreduce-to-apache-spark
33
2 Transition3 The following tools originally based on Hadoop MapReduce are being ported to Apache Spark
bull Pig Hive Sqoop Cascading Crunch Mahout hellip
34
Pig on Spark (Spork)bull Run Pig with ldquondashx sparkrdquo option for an easy migration
without development effortbull Speed up your existing pig scripts on Spark ( Query
Logical Plan Physical Pan)bull Leverage new Spark specific operators in Pig such as
Cachebull Still leverage many existing Pig UDF librariesbull Pig on Spark Umbrella Jira (Status Passed end-to-end test
cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059bull Fix outstanding issues and address additional Spark functionality
through the community
bull lsquoPig on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag19
35
Hive on Spark (Currently in Beta Expected in Hive 110)
bull New alternative to using MapReduce or Tez hivegt set hiveexecutionengine=sparkbull Help existing Hive applications running on
MapReduce or Tez easily migrate to Spark without development effort
bull Exposes Spark users to a viable feature-rich de facto standard SQL tool on Hadoop
bull Performance benefits especially for Hive queries involving multiple reducer stages
bull Hive on Spark Umbrella Jira (Status Open) Q1 2015 httpsissuesapacheorgjirabrowseHIVE-7292
36
Hive on Spark (Currently in Beta Expected in Hive 110)
bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-motivations-and-design-principles
bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Started
bull Hive on Spark February 11 2015 Szehon Ho Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark
bull Hive on spark is blazing fast or is it Carter Shanklin and Mostapah Mokhtar (Hortonworks) February 20 2015 httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final
bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12
37
Sqoop on Spark (Expected in Sqoop 2)
bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop
bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources
bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal
bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532
38
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop
bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark
39
Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml
40
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
41
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
42
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
43
3 IntegrationService Open Source Tool
StorageServing Layer
Data Formats
Data Ingestion ServicesResource Management
Search
SQL
44
3 Integrationbull Spark was designed to read and write data from and to HDFS as
well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851
45
3 Integrationbull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase
46
3 Integration bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra
47
3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new avenues
bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra
48
3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo
httpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights
bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
49
3 Integration bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu
ction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
50
3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph
databasebull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
51
3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A
bull Some issues are critical ones bull Running Spark on YARN
httpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
52
3 Integrationbull Spark SQL provides built in support for Hive tables
bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib
53
3 Integration bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline
Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
54
3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka
55
3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
56
3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data
bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml
57
3 Integrationbull Apache Parquet is a columnar storage format available
to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
58
3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark
SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem
bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format
59
3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
60
3 Integration bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html
61
3 Integration
bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from
MapReduce to Sparkbull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed
62
3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014
63
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
64
4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them
Hadoop ecosystem Spark ecosystem
65
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
66
4 Complementarity + bull Mesos and YARN can work together each for what
it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom
httpsparkbigdatacomcomponenttagstag41
67
4 Complementarity + References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management
bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620
68
4 Complementarity + bull Spark on Tez for efficient ETL https
githubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)
bull Tez supports enterprise security
69
4 Complementarity + bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
70
4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml
71
4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop
72
5 Key Takeaways1 Evolution of compute models is still ongoing Watch
out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
73
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
74
1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt
he-perfect-match-apache-spark-meets-swift
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
6
2 Surveysbull Hadoops historic focus on batch processing of data
was well supported by MapReduce but there is an appetite for more flexible developer tools to support the larger market of mid-size datasets and use cases that call for real-time processingrdquo 2015 Apache Spark Survey by Typesafe January 27 2015 httpwwwmarketwiredcompress-releasesurvey-indicates-apache-spark-gaining-developer-adoption-as-big-datas-projects-1986162htm
bull Apache Spark Preparing for the Next Wave of Reactive Big Data January 27 2015 by Typesafehttptypesafecomblogapache-spark-preparing-for-the-next-wave-of-reactive-big-data
7
Apache Spark Survey 2015 by Typesafe - Quick Snapshot
8
3 Vendorsbull Spark and Hadoop Working Together January 21
2014 by Ion Stoica httpsdatabrickscomblog20140121spark-and-hadoophtml
bull Uniform API for diverse workloads over diverse storage systems and runtimes Source Slide 16 of lsquoSparks Role in the Big Data Ecosystem (Spark Summit 2014) November 2014 Matei Zahariahttpwwwslidesharenetdatabricksspark-summit2014
bull The goal of Apache Spark is to have one engine for all data sources workloads and environmentsrdquo Source Slide 15 of lsquoNew Directions for Apache Spark in 2015 February 20 2015 Strata + Hadoop Summit Matei Zahariahttpwwwslidesharenetdatabricksnew-directions-for-apache-spark-in-2015
9
3 Vendorsbull ldquoSpark is already an excellent piece of software and is
advancing very quickly No vendor mdash no new project mdash is likely to catch up Chasing Spark would be a waste of time and would delay availability of real-time analytic and processing services for no good reason rdquoSource MapReduce and Spark December 302013 httpvisionclouderacommapreduce-spark
bull ldquoApache Spark is an open source parallel data processing framework that complements Apache Hadoop to make it easy to develop fast unified Big Data applications combining batch streaming and interactive analytics on all your datardquohttpwwwclouderacomcontentclouderaenproducts-and-servicescdhsparkhtml
10
3 Vendorsbull ldquoApache Spark is a general-purpose engine for large-
scale data processing Spark supports rapid application development for big data and allows for code reuse across batch interactive and streaming applications Spark also provides advanced execution graphs with in-memory pipelining to speed up end-to-end application performancerdquo httpswwwmaprcomproductsapache-spark
bull MapR Adds Complete Apache Spark Stack to its Distribution for Hadoophttpswwwmaprcomcompanypress-releasesmapr-adds-complete-apache-spark-stack-its-distribution-hadoop
11
3 Vendorsbull ldquoApache Spark provides an elegant attractive development API and allows data workers to rapidly iterate over data via machine learning and other data science techniques that require fast in-memory data processingrdquo httphortonworkscomhadoopspark
bull Hortonworks A shared vision for Apache Spark on Hadoop October 21 2014httpsdatabrickscomblog20141031hortonworks-a-shared-vision-for-apache-spark-on-hadoophtml
bull ldquoAt Hortonworks we love Spark and want to help our customers leverage all its benefitsrdquo October 30th 2014httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
12
4 Analystsbull Is Apache Spark replacing Hadoop or complementing
existing Hadoop practicebull Both are already happening
bull With uncertainty about ldquowhat is Hadooprdquo there is no reason to think solution stacks built on Spark not positioned as Hadoop will not continue to proliferate as the technology matures
bull At the same time Hadoop distributions are all embracing Spark and including it in their offerings
Source Hadoop Questions from Recent Webinar Span Spectrum February 25 2015httpblogsgartnercommerv-adrian20150225hadoop-questions-from-recent-webinar-span-spectrum
13
4 Analysts bull ldquoAfter hearing the confusion between Spark and Hadoop
one too many times I was inspired to write a report The Hadoop Ecosystem Overview Q4 2104
bull For those that have day jobs that donrsquot include constantly tracking Hadoop evolution I dove in and worked with Hadoop vendors and trusted consultants to create a framework
bull We divided the complex Hadoop ecosystem into a core set of tools that all work closely with data stored in Hadoop File System and extended group of components that leverage but do not require itrdquo Source Elephants Pigs Rhinos and Giraphs Oh My ndash Its Time To Get A Handle On Hadoop Posted by Brian Hopkins on November 26 2014
httpblogsforrestercombrian_hopkins14-11-26-elephants_pigs_rhinos_and_giraphs_oh_my_its_time_to_get_a_handle_on_hadoop
14
5 Key Takeaways1 News Big Data is no longer a Hadoop
monopoly2 Surveys Listen to what Spark developers are
saying 3 Vendors ltHadoop Vendorgt-tinted goggles
FUD is still being lsquoofferedrsquo by some Hadoop vendors Claims need to be contextualized
4 Analysts Thorough understanding of the market dynamics
15
II Big Data Typical Big Data Stack Hadoop Spark
1 Big Data2 Typical Big Data Stack 3 Apache Hadoop4 Apache Spark5 Key Takeaways
16
1 Big Databull Big Data is still one of the most inflated buzzword of the last years
bull Big Data is a broad term for data sets so large or complex that traditional data processing tools are inadequate httpenwikipediaorgwikiBig_data
bull Hadoop is becoming a traditional tool Above definition is inadequate
bull ldquoBig Data refers to datasets and flows large enough that has outpaced our capability to store process analyze and understandrdquo Amir H Payberah Swedish Institute of Computer Science (SICS)
17
2 Typical Big Data Stack
18
3 Apache Hadoopbull Apache Hadoop as an example of a Typical Big Data
Stack bull Hadoop ecosystem = Hadoop Stack + many other tools
(either open source and free or commercial ones)bull Big Data Ecosystem Dataset httpbigdataandreamostosiname
Incomplete but a useful list of Big Data related projects packed into a JSON dataset
bull Hadoops Impact on Data Managements Future - Amr Awadallah (Strata + Hadoop 2015) February 19 2015 Watch video at 236 on lsquoHadoop Isnrsquot Just Hadoop Anymorersquo for a picture representing the evolution of Apache Hadoop httpswwwyoutubecomwatchv=1KvTZZAkHy0
19
4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stackbull Apache Spark provides you Big Data computing and more
bull BYOS Bring Your Own Storage bull BYOC Bring Your Own Clusterbull Spark Core httpsparkbigdatacomcomponenttagstag11-core-sparkbull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-
streamingbull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sqlbull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllibbull GraphX httpsparkbigdatacomcomponenttagstag6-graphx
bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned
20
5 Key Takeaways1 Big Data Still one of the most inflated
buzzword 2 Typical Big Data Stack Big Data Stacks look
similar on paper Arenrsquot they3 Apache Hadoop Hadoop is no longer
lsquosynonymousrsquo of Big Data4 Apache Spark Emergence of the Apache
Spark ecosystem
21
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
22
1 Evolution of Programming APIsbull MapReduce in Java is like assembly code of Big
Data httpwikiapacheorghadoopWordCount
bull Pig httppigapacheorg
bull Hive httphiveapacheorg
bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi
bull Cascading httpwwwcascadingorg
bull Scalding A Scala API for Cascading httptwittercomscalding
bull Crunch httpcrunchapacheorg
bull Scrunch httpcrunchapacheorgscrunchhtml
23
1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007 MapReduce v1 was the only choice as a compute model (Execution Engine) on Hadoop Now we have in addition to MapReduce v2 Tez Spark and Flink
bull Batch bull Batchbull Interactive
bull Batchbull Interactivebull Near-Real
time
bull Batchbull Interactivebull Real-Timebull Iterative
bull 1st Generation
bull 2nd
Generationbull 3rd
Generationbull 4th
Generation
24
1 Evolutionbull This is how Hadoop MapReduce is branding itself ldquoA YARN-based
system for parallel processing of large data sets httphadoopapacheorg
bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip
bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job
bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics
25
1 Evolutionbull Tez Hindi for ldquospeedrdquobull This is how Apache Tez is branding itself ldquoThe Apache Tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data It is currently built atop YARNrdquo
Source httptezapacheorg
bull Apachetrade Tez is an extensible framework for building high performance batch and interactive data processing applications coordinated by YARN in Apache Hadoop
26
1 Evolution bull lsquoSparkrsquo for lightning fast speedbull This is how Apache Spark is branding itself ldquoApache Sparktrade is a fast and general engine for large-scale data processingrdquo httpssparkapacheorg
bull Apache Spark is a general purpose cluster computing framework its execution model supports wide variety of use cases batch interactive near-real time
bull The rapid in-memory processing of resilient distributed datasets (RDDs) is the ldquocore capabilityrdquo of Apache Spark
27
1 Evolution Apache Flinkbull Flink German for ldquonimble swift speedyrdquobull This is how Apache Flink is branding itself ldquoFast and
reliable large-scale data processing enginerdquobull Apache Flink httpflinkapacheorg offers
bull Batch and Streaming in the same systembull Beyond DAGs (Cyclic operator graphs)bull Powerful expressive APIsbull Inside-the-system iterationsbull Full Hadoop compatibility bull Automatic language independent optimizer
bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink
28
Hadoop MapReduce vs Tez vs SparkCriteria
License Open SourceApache 20 version 2x
Open SourceApache 20 version 0x
Open SourceApache 20 version 1x
Processing Model
On-Disk (Disk- based parallelization) Batch
On-Disk Batch Interactive
In-Memory On-Disk Batch Interactive Streaming (Near Real-Time)
Language written in
Java Java Scala
API [Java Python Scala] User-Facing
Java[ ISVEngineTool builder]
[Scala Java Python] User-Facing
Libraries None separate tools None [Spark Core Spark Streaming Spark SQL MLlib GraphX]
29
Hadoop MapReduce vs Tez vs SparkCriteria
Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to Hadoop
Ease of Use Difficult to program needs abstractions
No Interactive mode except Hive Pig
Difficult to program
No Interactive mode except Hive Pig
Easy to program no need of abstractionsInteractive mode
Compatibility
to data types and data sources is same
to data types and data sources is same
to data types and data sources is same
YARN integration
YARN application Ground up YARN application
Spark is moving towards YARN
30
Hadoop MapReduce vs Tez vs SparkCriteria
Deployment YARN YARN [Standalone YARN SIMR Mesos hellip]
Performance - Good performance when data fits into memory
- performance degradation otherwise
Security More features and projects
More features and projects
Still in its infancy
Partial support
31
IV Spark with Hadoop
1 Evolution2 Transition3 Integration4 Complementarity5 Key Takeaways
32
2 Transitionbull Existing Hadoop MapReduce projects can migrate to Spark and leverage Spark Core as execution engine1 You can often reuse your mapper and
reducer functions and just call them in Spark from Java or Scala
2 You can translate your code from MapReduce to Apache Spark How-to Translate from MapReduce to Apache Spark
httpblogclouderacomblog201409how-to-translate-from-mapreduce-to-apache-spark
33
2 Transition3 The following tools originally based on Hadoop MapReduce are being ported to Apache Spark
bull Pig Hive Sqoop Cascading Crunch Mahout hellip
34
Pig on Spark (Spork)bull Run Pig with ldquondashx sparkrdquo option for an easy migration
without development effortbull Speed up your existing pig scripts on Spark ( Query
Logical Plan Physical Pan)bull Leverage new Spark specific operators in Pig such as
Cachebull Still leverage many existing Pig UDF librariesbull Pig on Spark Umbrella Jira (Status Passed end-to-end test
cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059bull Fix outstanding issues and address additional Spark functionality
through the community
bull lsquoPig on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag19
35
Hive on Spark (Currently in Beta Expected in Hive 110)
bull New alternative to using MapReduce or Tez hivegt set hiveexecutionengine=sparkbull Help existing Hive applications running on
MapReduce or Tez easily migrate to Spark without development effort
bull Exposes Spark users to a viable feature-rich de facto standard SQL tool on Hadoop
bull Performance benefits especially for Hive queries involving multiple reducer stages
bull Hive on Spark Umbrella Jira (Status Open) Q1 2015 httpsissuesapacheorgjirabrowseHIVE-7292
36
Hive on Spark (Currently in Beta Expected in Hive 110)
bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-motivations-and-design-principles
bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Started
bull Hive on Spark February 11 2015 Szehon Ho Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark
bull Hive on spark is blazing fast or is it Carter Shanklin and Mostapah Mokhtar (Hortonworks) February 20 2015 httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final
bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12
37
Sqoop on Spark (Expected in Sqoop 2)
bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop
bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources
bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal
bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532
38
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop
bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark
39
Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml
40
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
41
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
42
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
43
3 IntegrationService Open Source Tool
StorageServing Layer
Data Formats
Data Ingestion ServicesResource Management
Search
SQL
44
3 Integrationbull Spark was designed to read and write data from and to HDFS as
well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851
45
3 Integrationbull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase
46
3 Integration bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra
47
3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new avenues
bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra
48
3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo
httpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights
bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
49
3 Integration bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu
ction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
50
3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph
databasebull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
51
3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A
bull Some issues are critical ones bull Running Spark on YARN
httpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
52
3 Integrationbull Spark SQL provides built in support for Hive tables
bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib
53
3 Integration bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline
Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
54
3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka
55
3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
56
3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data
bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml
57
3 Integrationbull Apache Parquet is a columnar storage format available
to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
58
3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark
SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem
bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format
59
3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
60
3 Integration bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html
61
3 Integration
bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from
MapReduce to Sparkbull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed
62
3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014
63
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
64
4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them
Hadoop ecosystem Spark ecosystem
65
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
66
4 Complementarity + bull Mesos and YARN can work together each for what
it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom
httpsparkbigdatacomcomponenttagstag41
67
4 Complementarity + References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management
bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620
68
4 Complementarity + bull Spark on Tez for efficient ETL https
githubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)
bull Tez supports enterprise security
69
4 Complementarity + bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
70
4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml
71
4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop
72
5 Key Takeaways1 Evolution of compute models is still ongoing Watch
out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
73
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
74
1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt
he-perfect-match-apache-spark-meets-swift
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
7
Apache Spark Survey 2015 by Typesafe - Quick Snapshot
8
3 Vendorsbull Spark and Hadoop Working Together January 21
2014 by Ion Stoica httpsdatabrickscomblog20140121spark-and-hadoophtml
bull Uniform API for diverse workloads over diverse storage systems and runtimes Source Slide 16 of lsquoSparks Role in the Big Data Ecosystem (Spark Summit 2014) November 2014 Matei Zahariahttpwwwslidesharenetdatabricksspark-summit2014
bull The goal of Apache Spark is to have one engine for all data sources workloads and environmentsrdquo Source Slide 15 of lsquoNew Directions for Apache Spark in 2015 February 20 2015 Strata + Hadoop Summit Matei Zahariahttpwwwslidesharenetdatabricksnew-directions-for-apache-spark-in-2015
9
3 Vendorsbull ldquoSpark is already an excellent piece of software and is
advancing very quickly No vendor mdash no new project mdash is likely to catch up Chasing Spark would be a waste of time and would delay availability of real-time analytic and processing services for no good reason rdquoSource MapReduce and Spark December 302013 httpvisionclouderacommapreduce-spark
bull ldquoApache Spark is an open source parallel data processing framework that complements Apache Hadoop to make it easy to develop fast unified Big Data applications combining batch streaming and interactive analytics on all your datardquohttpwwwclouderacomcontentclouderaenproducts-and-servicescdhsparkhtml
10
3 Vendorsbull ldquoApache Spark is a general-purpose engine for large-
scale data processing Spark supports rapid application development for big data and allows for code reuse across batch interactive and streaming applications Spark also provides advanced execution graphs with in-memory pipelining to speed up end-to-end application performancerdquo httpswwwmaprcomproductsapache-spark
bull MapR Adds Complete Apache Spark Stack to its Distribution for Hadoophttpswwwmaprcomcompanypress-releasesmapr-adds-complete-apache-spark-stack-its-distribution-hadoop
11
3 Vendorsbull ldquoApache Spark provides an elegant attractive development API and allows data workers to rapidly iterate over data via machine learning and other data science techniques that require fast in-memory data processingrdquo httphortonworkscomhadoopspark
bull Hortonworks A shared vision for Apache Spark on Hadoop October 21 2014httpsdatabrickscomblog20141031hortonworks-a-shared-vision-for-apache-spark-on-hadoophtml
bull ldquoAt Hortonworks we love Spark and want to help our customers leverage all its benefitsrdquo October 30th 2014httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
12
4 Analystsbull Is Apache Spark replacing Hadoop or complementing
existing Hadoop practicebull Both are already happening
bull With uncertainty about ldquowhat is Hadooprdquo there is no reason to think solution stacks built on Spark not positioned as Hadoop will not continue to proliferate as the technology matures
bull At the same time Hadoop distributions are all embracing Spark and including it in their offerings
Source Hadoop Questions from Recent Webinar Span Spectrum February 25 2015httpblogsgartnercommerv-adrian20150225hadoop-questions-from-recent-webinar-span-spectrum
13
4 Analysts bull ldquoAfter hearing the confusion between Spark and Hadoop
one too many times I was inspired to write a report The Hadoop Ecosystem Overview Q4 2104
bull For those that have day jobs that donrsquot include constantly tracking Hadoop evolution I dove in and worked with Hadoop vendors and trusted consultants to create a framework
bull We divided the complex Hadoop ecosystem into a core set of tools that all work closely with data stored in Hadoop File System and extended group of components that leverage but do not require itrdquo Source Elephants Pigs Rhinos and Giraphs Oh My ndash Its Time To Get A Handle On Hadoop Posted by Brian Hopkins on November 26 2014
httpblogsforrestercombrian_hopkins14-11-26-elephants_pigs_rhinos_and_giraphs_oh_my_its_time_to_get_a_handle_on_hadoop
14
5 Key Takeaways1 News Big Data is no longer a Hadoop
monopoly2 Surveys Listen to what Spark developers are
saying 3 Vendors ltHadoop Vendorgt-tinted goggles
FUD is still being lsquoofferedrsquo by some Hadoop vendors Claims need to be contextualized
4 Analysts Thorough understanding of the market dynamics
15
II Big Data Typical Big Data Stack Hadoop Spark
1 Big Data2 Typical Big Data Stack 3 Apache Hadoop4 Apache Spark5 Key Takeaways
16
1 Big Databull Big Data is still one of the most inflated buzzword of the last years
bull Big Data is a broad term for data sets so large or complex that traditional data processing tools are inadequate httpenwikipediaorgwikiBig_data
bull Hadoop is becoming a traditional tool Above definition is inadequate
bull ldquoBig Data refers to datasets and flows large enough that has outpaced our capability to store process analyze and understandrdquo Amir H Payberah Swedish Institute of Computer Science (SICS)
17
2 Typical Big Data Stack
18
3 Apache Hadoopbull Apache Hadoop as an example of a Typical Big Data
Stack bull Hadoop ecosystem = Hadoop Stack + many other tools
(either open source and free or commercial ones)bull Big Data Ecosystem Dataset httpbigdataandreamostosiname
Incomplete but a useful list of Big Data related projects packed into a JSON dataset
bull Hadoops Impact on Data Managements Future - Amr Awadallah (Strata + Hadoop 2015) February 19 2015 Watch video at 236 on lsquoHadoop Isnrsquot Just Hadoop Anymorersquo for a picture representing the evolution of Apache Hadoop httpswwwyoutubecomwatchv=1KvTZZAkHy0
19
4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stackbull Apache Spark provides you Big Data computing and more
bull BYOS Bring Your Own Storage bull BYOC Bring Your Own Clusterbull Spark Core httpsparkbigdatacomcomponenttagstag11-core-sparkbull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-
streamingbull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sqlbull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllibbull GraphX httpsparkbigdatacomcomponenttagstag6-graphx
bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned
20
5 Key Takeaways1 Big Data Still one of the most inflated
buzzword 2 Typical Big Data Stack Big Data Stacks look
similar on paper Arenrsquot they3 Apache Hadoop Hadoop is no longer
lsquosynonymousrsquo of Big Data4 Apache Spark Emergence of the Apache
Spark ecosystem
21
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
22
1 Evolution of Programming APIsbull MapReduce in Java is like assembly code of Big
Data httpwikiapacheorghadoopWordCount
bull Pig httppigapacheorg
bull Hive httphiveapacheorg
bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi
bull Cascading httpwwwcascadingorg
bull Scalding A Scala API for Cascading httptwittercomscalding
bull Crunch httpcrunchapacheorg
bull Scrunch httpcrunchapacheorgscrunchhtml
23
1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007 MapReduce v1 was the only choice as a compute model (Execution Engine) on Hadoop Now we have in addition to MapReduce v2 Tez Spark and Flink
bull Batch bull Batchbull Interactive
bull Batchbull Interactivebull Near-Real
time
bull Batchbull Interactivebull Real-Timebull Iterative
bull 1st Generation
bull 2nd
Generationbull 3rd
Generationbull 4th
Generation
24
1 Evolutionbull This is how Hadoop MapReduce is branding itself ldquoA YARN-based
system for parallel processing of large data sets httphadoopapacheorg
bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip
bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job
bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics
25
1 Evolutionbull Tez Hindi for ldquospeedrdquobull This is how Apache Tez is branding itself ldquoThe Apache Tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data It is currently built atop YARNrdquo
Source httptezapacheorg
bull Apachetrade Tez is an extensible framework for building high performance batch and interactive data processing applications coordinated by YARN in Apache Hadoop
26
1 Evolution bull lsquoSparkrsquo for lightning fast speedbull This is how Apache Spark is branding itself ldquoApache Sparktrade is a fast and general engine for large-scale data processingrdquo httpssparkapacheorg
bull Apache Spark is a general purpose cluster computing framework its execution model supports wide variety of use cases batch interactive near-real time
bull The rapid in-memory processing of resilient distributed datasets (RDDs) is the ldquocore capabilityrdquo of Apache Spark
27
1 Evolution Apache Flinkbull Flink German for ldquonimble swift speedyrdquobull This is how Apache Flink is branding itself ldquoFast and
reliable large-scale data processing enginerdquobull Apache Flink httpflinkapacheorg offers
bull Batch and Streaming in the same systembull Beyond DAGs (Cyclic operator graphs)bull Powerful expressive APIsbull Inside-the-system iterationsbull Full Hadoop compatibility bull Automatic language independent optimizer
bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink
28
Hadoop MapReduce vs Tez vs SparkCriteria
License Open SourceApache 20 version 2x
Open SourceApache 20 version 0x
Open SourceApache 20 version 1x
Processing Model
On-Disk (Disk- based parallelization) Batch
On-Disk Batch Interactive
In-Memory On-Disk Batch Interactive Streaming (Near Real-Time)
Language written in
Java Java Scala
API [Java Python Scala] User-Facing
Java[ ISVEngineTool builder]
[Scala Java Python] User-Facing
Libraries None separate tools None [Spark Core Spark Streaming Spark SQL MLlib GraphX]
29
Hadoop MapReduce vs Tez vs SparkCriteria
Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to Hadoop
Ease of Use Difficult to program needs abstractions
No Interactive mode except Hive Pig
Difficult to program
No Interactive mode except Hive Pig
Easy to program no need of abstractionsInteractive mode
Compatibility
to data types and data sources is same
to data types and data sources is same
to data types and data sources is same
YARN integration
YARN application Ground up YARN application
Spark is moving towards YARN
30
Hadoop MapReduce vs Tez vs SparkCriteria
Deployment YARN YARN [Standalone YARN SIMR Mesos hellip]
Performance - Good performance when data fits into memory
- performance degradation otherwise
Security More features and projects
More features and projects
Still in its infancy
Partial support
31
IV Spark with Hadoop
1 Evolution2 Transition3 Integration4 Complementarity5 Key Takeaways
32
2 Transitionbull Existing Hadoop MapReduce projects can migrate to Spark and leverage Spark Core as execution engine1 You can often reuse your mapper and
reducer functions and just call them in Spark from Java or Scala
2 You can translate your code from MapReduce to Apache Spark How-to Translate from MapReduce to Apache Spark
httpblogclouderacomblog201409how-to-translate-from-mapreduce-to-apache-spark
33
2 Transition3 The following tools originally based on Hadoop MapReduce are being ported to Apache Spark
bull Pig Hive Sqoop Cascading Crunch Mahout hellip
34
Pig on Spark (Spork)bull Run Pig with ldquondashx sparkrdquo option for an easy migration
without development effortbull Speed up your existing pig scripts on Spark ( Query
Logical Plan Physical Pan)bull Leverage new Spark specific operators in Pig such as
Cachebull Still leverage many existing Pig UDF librariesbull Pig on Spark Umbrella Jira (Status Passed end-to-end test
cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059bull Fix outstanding issues and address additional Spark functionality
through the community
bull lsquoPig on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag19
35
Hive on Spark (Currently in Beta Expected in Hive 110)
bull New alternative to using MapReduce or Tez hivegt set hiveexecutionengine=sparkbull Help existing Hive applications running on
MapReduce or Tez easily migrate to Spark without development effort
bull Exposes Spark users to a viable feature-rich de facto standard SQL tool on Hadoop
bull Performance benefits especially for Hive queries involving multiple reducer stages
bull Hive on Spark Umbrella Jira (Status Open) Q1 2015 httpsissuesapacheorgjirabrowseHIVE-7292
36
Hive on Spark (Currently in Beta Expected in Hive 110)
bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-motivations-and-design-principles
bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Started
bull Hive on Spark February 11 2015 Szehon Ho Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark
bull Hive on spark is blazing fast or is it Carter Shanklin and Mostapah Mokhtar (Hortonworks) February 20 2015 httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final
bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12
37
Sqoop on Spark (Expected in Sqoop 2)
bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop
bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources
bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal
bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532
38
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop
bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark
39
Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml
40
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
41
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
42
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
43
3 IntegrationService Open Source Tool
StorageServing Layer
Data Formats
Data Ingestion ServicesResource Management
Search
SQL
44
3 Integrationbull Spark was designed to read and write data from and to HDFS as
well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851
45
3 Integrationbull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase
46
3 Integration bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra
47
3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new avenues
bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra
48
3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo
httpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights
bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
49
3 Integration bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu
ction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
50
3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph
databasebull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
51
3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A
bull Some issues are critical ones bull Running Spark on YARN
httpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
52
3 Integrationbull Spark SQL provides built in support for Hive tables
bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib
53
3 Integration bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline
Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
54
3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka
55
3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
56
3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data
bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml
57
3 Integrationbull Apache Parquet is a columnar storage format available
to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
58
3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark
SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem
bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format
59
3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
60
3 Integration bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html
61
3 Integration
bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from
MapReduce to Sparkbull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed
62
3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014
63
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
64
4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them
Hadoop ecosystem Spark ecosystem
65
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
66
4 Complementarity + bull Mesos and YARN can work together each for what
it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom
httpsparkbigdatacomcomponenttagstag41
67
4 Complementarity + References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management
bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620
68
4 Complementarity + bull Spark on Tez for efficient ETL https
githubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)
bull Tez supports enterprise security
69
4 Complementarity + bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
70
4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml
71
4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop
72
5 Key Takeaways1 Evolution of compute models is still ongoing Watch
out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
73
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
74
1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt
he-perfect-match-apache-spark-meets-swift
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
8
3 Vendorsbull Spark and Hadoop Working Together January 21
2014 by Ion Stoica httpsdatabrickscomblog20140121spark-and-hadoophtml
bull Uniform API for diverse workloads over diverse storage systems and runtimes Source Slide 16 of lsquoSparks Role in the Big Data Ecosystem (Spark Summit 2014) November 2014 Matei Zahariahttpwwwslidesharenetdatabricksspark-summit2014
bull The goal of Apache Spark is to have one engine for all data sources workloads and environmentsrdquo Source Slide 15 of lsquoNew Directions for Apache Spark in 2015 February 20 2015 Strata + Hadoop Summit Matei Zahariahttpwwwslidesharenetdatabricksnew-directions-for-apache-spark-in-2015
9
3 Vendorsbull ldquoSpark is already an excellent piece of software and is
advancing very quickly No vendor mdash no new project mdash is likely to catch up Chasing Spark would be a waste of time and would delay availability of real-time analytic and processing services for no good reason rdquoSource MapReduce and Spark December 302013 httpvisionclouderacommapreduce-spark
bull ldquoApache Spark is an open source parallel data processing framework that complements Apache Hadoop to make it easy to develop fast unified Big Data applications combining batch streaming and interactive analytics on all your datardquohttpwwwclouderacomcontentclouderaenproducts-and-servicescdhsparkhtml
10
3 Vendorsbull ldquoApache Spark is a general-purpose engine for large-
scale data processing Spark supports rapid application development for big data and allows for code reuse across batch interactive and streaming applications Spark also provides advanced execution graphs with in-memory pipelining to speed up end-to-end application performancerdquo httpswwwmaprcomproductsapache-spark
bull MapR Adds Complete Apache Spark Stack to its Distribution for Hadoophttpswwwmaprcomcompanypress-releasesmapr-adds-complete-apache-spark-stack-its-distribution-hadoop
11
3 Vendorsbull ldquoApache Spark provides an elegant attractive development API and allows data workers to rapidly iterate over data via machine learning and other data science techniques that require fast in-memory data processingrdquo httphortonworkscomhadoopspark
bull Hortonworks A shared vision for Apache Spark on Hadoop October 21 2014httpsdatabrickscomblog20141031hortonworks-a-shared-vision-for-apache-spark-on-hadoophtml
bull ldquoAt Hortonworks we love Spark and want to help our customers leverage all its benefitsrdquo October 30th 2014httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
12
4 Analystsbull Is Apache Spark replacing Hadoop or complementing
existing Hadoop practicebull Both are already happening
bull With uncertainty about ldquowhat is Hadooprdquo there is no reason to think solution stacks built on Spark not positioned as Hadoop will not continue to proliferate as the technology matures
bull At the same time Hadoop distributions are all embracing Spark and including it in their offerings
Source Hadoop Questions from Recent Webinar Span Spectrum February 25 2015httpblogsgartnercommerv-adrian20150225hadoop-questions-from-recent-webinar-span-spectrum
13
4 Analysts bull ldquoAfter hearing the confusion between Spark and Hadoop
one too many times I was inspired to write a report The Hadoop Ecosystem Overview Q4 2104
bull For those that have day jobs that donrsquot include constantly tracking Hadoop evolution I dove in and worked with Hadoop vendors and trusted consultants to create a framework
bull We divided the complex Hadoop ecosystem into a core set of tools that all work closely with data stored in Hadoop File System and extended group of components that leverage but do not require itrdquo Source Elephants Pigs Rhinos and Giraphs Oh My ndash Its Time To Get A Handle On Hadoop Posted by Brian Hopkins on November 26 2014
httpblogsforrestercombrian_hopkins14-11-26-elephants_pigs_rhinos_and_giraphs_oh_my_its_time_to_get_a_handle_on_hadoop
14
5 Key Takeaways1 News Big Data is no longer a Hadoop
monopoly2 Surveys Listen to what Spark developers are
saying 3 Vendors ltHadoop Vendorgt-tinted goggles
FUD is still being lsquoofferedrsquo by some Hadoop vendors Claims need to be contextualized
4 Analysts Thorough understanding of the market dynamics
15
II Big Data Typical Big Data Stack Hadoop Spark
1 Big Data2 Typical Big Data Stack 3 Apache Hadoop4 Apache Spark5 Key Takeaways
16
1 Big Databull Big Data is still one of the most inflated buzzword of the last years
bull Big Data is a broad term for data sets so large or complex that traditional data processing tools are inadequate httpenwikipediaorgwikiBig_data
bull Hadoop is becoming a traditional tool Above definition is inadequate
bull ldquoBig Data refers to datasets and flows large enough that has outpaced our capability to store process analyze and understandrdquo Amir H Payberah Swedish Institute of Computer Science (SICS)
17
2 Typical Big Data Stack
18
3 Apache Hadoopbull Apache Hadoop as an example of a Typical Big Data
Stack bull Hadoop ecosystem = Hadoop Stack + many other tools
(either open source and free or commercial ones)bull Big Data Ecosystem Dataset httpbigdataandreamostosiname
Incomplete but a useful list of Big Data related projects packed into a JSON dataset
bull Hadoops Impact on Data Managements Future - Amr Awadallah (Strata + Hadoop 2015) February 19 2015 Watch video at 236 on lsquoHadoop Isnrsquot Just Hadoop Anymorersquo for a picture representing the evolution of Apache Hadoop httpswwwyoutubecomwatchv=1KvTZZAkHy0
19
4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stackbull Apache Spark provides you Big Data computing and more
bull BYOS Bring Your Own Storage bull BYOC Bring Your Own Clusterbull Spark Core httpsparkbigdatacomcomponenttagstag11-core-sparkbull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-
streamingbull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sqlbull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllibbull GraphX httpsparkbigdatacomcomponenttagstag6-graphx
bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned
20
5 Key Takeaways1 Big Data Still one of the most inflated
buzzword 2 Typical Big Data Stack Big Data Stacks look
similar on paper Arenrsquot they3 Apache Hadoop Hadoop is no longer
lsquosynonymousrsquo of Big Data4 Apache Spark Emergence of the Apache
Spark ecosystem
21
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
22
1 Evolution of Programming APIsbull MapReduce in Java is like assembly code of Big
Data httpwikiapacheorghadoopWordCount
bull Pig httppigapacheorg
bull Hive httphiveapacheorg
bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi
bull Cascading httpwwwcascadingorg
bull Scalding A Scala API for Cascading httptwittercomscalding
bull Crunch httpcrunchapacheorg
bull Scrunch httpcrunchapacheorgscrunchhtml
23
1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007 MapReduce v1 was the only choice as a compute model (Execution Engine) on Hadoop Now we have in addition to MapReduce v2 Tez Spark and Flink
bull Batch bull Batchbull Interactive
bull Batchbull Interactivebull Near-Real
time
bull Batchbull Interactivebull Real-Timebull Iterative
bull 1st Generation
bull 2nd
Generationbull 3rd
Generationbull 4th
Generation
24
1 Evolutionbull This is how Hadoop MapReduce is branding itself ldquoA YARN-based
system for parallel processing of large data sets httphadoopapacheorg
bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip
bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job
bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics
25
1 Evolutionbull Tez Hindi for ldquospeedrdquobull This is how Apache Tez is branding itself ldquoThe Apache Tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data It is currently built atop YARNrdquo
Source httptezapacheorg
bull Apachetrade Tez is an extensible framework for building high performance batch and interactive data processing applications coordinated by YARN in Apache Hadoop
26
1 Evolution bull lsquoSparkrsquo for lightning fast speedbull This is how Apache Spark is branding itself ldquoApache Sparktrade is a fast and general engine for large-scale data processingrdquo httpssparkapacheorg
bull Apache Spark is a general purpose cluster computing framework its execution model supports wide variety of use cases batch interactive near-real time
bull The rapid in-memory processing of resilient distributed datasets (RDDs) is the ldquocore capabilityrdquo of Apache Spark
27
1 Evolution Apache Flinkbull Flink German for ldquonimble swift speedyrdquobull This is how Apache Flink is branding itself ldquoFast and
reliable large-scale data processing enginerdquobull Apache Flink httpflinkapacheorg offers
bull Batch and Streaming in the same systembull Beyond DAGs (Cyclic operator graphs)bull Powerful expressive APIsbull Inside-the-system iterationsbull Full Hadoop compatibility bull Automatic language independent optimizer
bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink
28
Hadoop MapReduce vs Tez vs SparkCriteria
License Open SourceApache 20 version 2x
Open SourceApache 20 version 0x
Open SourceApache 20 version 1x
Processing Model
On-Disk (Disk- based parallelization) Batch
On-Disk Batch Interactive
In-Memory On-Disk Batch Interactive Streaming (Near Real-Time)
Language written in
Java Java Scala
API [Java Python Scala] User-Facing
Java[ ISVEngineTool builder]
[Scala Java Python] User-Facing
Libraries None separate tools None [Spark Core Spark Streaming Spark SQL MLlib GraphX]
29
Hadoop MapReduce vs Tez vs SparkCriteria
Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to Hadoop
Ease of Use Difficult to program needs abstractions
No Interactive mode except Hive Pig
Difficult to program
No Interactive mode except Hive Pig
Easy to program no need of abstractionsInteractive mode
Compatibility
to data types and data sources is same
to data types and data sources is same
to data types and data sources is same
YARN integration
YARN application Ground up YARN application
Spark is moving towards YARN
30
Hadoop MapReduce vs Tez vs SparkCriteria
Deployment YARN YARN [Standalone YARN SIMR Mesos hellip]
Performance - Good performance when data fits into memory
- performance degradation otherwise
Security More features and projects
More features and projects
Still in its infancy
Partial support
31
IV Spark with Hadoop
1 Evolution2 Transition3 Integration4 Complementarity5 Key Takeaways
32
2 Transitionbull Existing Hadoop MapReduce projects can migrate to Spark and leverage Spark Core as execution engine1 You can often reuse your mapper and
reducer functions and just call them in Spark from Java or Scala
2 You can translate your code from MapReduce to Apache Spark How-to Translate from MapReduce to Apache Spark
httpblogclouderacomblog201409how-to-translate-from-mapreduce-to-apache-spark
33
2 Transition3 The following tools originally based on Hadoop MapReduce are being ported to Apache Spark
bull Pig Hive Sqoop Cascading Crunch Mahout hellip
34
Pig on Spark (Spork)bull Run Pig with ldquondashx sparkrdquo option for an easy migration
without development effortbull Speed up your existing pig scripts on Spark ( Query
Logical Plan Physical Pan)bull Leverage new Spark specific operators in Pig such as
Cachebull Still leverage many existing Pig UDF librariesbull Pig on Spark Umbrella Jira (Status Passed end-to-end test
cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059bull Fix outstanding issues and address additional Spark functionality
through the community
bull lsquoPig on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag19
35
Hive on Spark (Currently in Beta Expected in Hive 110)
bull New alternative to using MapReduce or Tez hivegt set hiveexecutionengine=sparkbull Help existing Hive applications running on
MapReduce or Tez easily migrate to Spark without development effort
bull Exposes Spark users to a viable feature-rich de facto standard SQL tool on Hadoop
bull Performance benefits especially for Hive queries involving multiple reducer stages
bull Hive on Spark Umbrella Jira (Status Open) Q1 2015 httpsissuesapacheorgjirabrowseHIVE-7292
36
Hive on Spark (Currently in Beta Expected in Hive 110)
bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-motivations-and-design-principles
bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Started
bull Hive on Spark February 11 2015 Szehon Ho Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark
bull Hive on spark is blazing fast or is it Carter Shanklin and Mostapah Mokhtar (Hortonworks) February 20 2015 httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final
bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12
37
Sqoop on Spark (Expected in Sqoop 2)
bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop
bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources
bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal
bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532
38
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop
bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark
39
Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml
40
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
41
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
42
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
43
3 IntegrationService Open Source Tool
StorageServing Layer
Data Formats
Data Ingestion ServicesResource Management
Search
SQL
44
3 Integrationbull Spark was designed to read and write data from and to HDFS as
well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851
45
3 Integrationbull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase
46
3 Integration bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra
47
3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new avenues
bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra
48
3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo
httpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights
bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
49
3 Integration bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu
ction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
50
3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph
databasebull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
51
3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A
bull Some issues are critical ones bull Running Spark on YARN
httpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
52
3 Integrationbull Spark SQL provides built in support for Hive tables
bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib
53
3 Integration bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline
Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
54
3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka
55
3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
56
3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data
bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml
57
3 Integrationbull Apache Parquet is a columnar storage format available
to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
58
3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark
SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem
bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format
59
3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
60
3 Integration bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html
61
3 Integration
bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from
MapReduce to Sparkbull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed
62
3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014
63
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
64
4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them
Hadoop ecosystem Spark ecosystem
65
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
66
4 Complementarity + bull Mesos and YARN can work together each for what
it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom
httpsparkbigdatacomcomponenttagstag41
67
4 Complementarity + References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management
bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620
68
4 Complementarity + bull Spark on Tez for efficient ETL https
githubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)
bull Tez supports enterprise security
69
4 Complementarity + bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
70
4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml
71
4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop
72
5 Key Takeaways1 Evolution of compute models is still ongoing Watch
out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
73
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
74
1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt
he-perfect-match-apache-spark-meets-swift
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
9
3 Vendorsbull ldquoSpark is already an excellent piece of software and is
advancing very quickly No vendor mdash no new project mdash is likely to catch up Chasing Spark would be a waste of time and would delay availability of real-time analytic and processing services for no good reason rdquoSource MapReduce and Spark December 302013 httpvisionclouderacommapreduce-spark
bull ldquoApache Spark is an open source parallel data processing framework that complements Apache Hadoop to make it easy to develop fast unified Big Data applications combining batch streaming and interactive analytics on all your datardquohttpwwwclouderacomcontentclouderaenproducts-and-servicescdhsparkhtml
10
3 Vendorsbull ldquoApache Spark is a general-purpose engine for large-
scale data processing Spark supports rapid application development for big data and allows for code reuse across batch interactive and streaming applications Spark also provides advanced execution graphs with in-memory pipelining to speed up end-to-end application performancerdquo httpswwwmaprcomproductsapache-spark
bull MapR Adds Complete Apache Spark Stack to its Distribution for Hadoophttpswwwmaprcomcompanypress-releasesmapr-adds-complete-apache-spark-stack-its-distribution-hadoop
11
3 Vendorsbull ldquoApache Spark provides an elegant attractive development API and allows data workers to rapidly iterate over data via machine learning and other data science techniques that require fast in-memory data processingrdquo httphortonworkscomhadoopspark
bull Hortonworks A shared vision for Apache Spark on Hadoop October 21 2014httpsdatabrickscomblog20141031hortonworks-a-shared-vision-for-apache-spark-on-hadoophtml
bull ldquoAt Hortonworks we love Spark and want to help our customers leverage all its benefitsrdquo October 30th 2014httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
12
4 Analystsbull Is Apache Spark replacing Hadoop or complementing
existing Hadoop practicebull Both are already happening
bull With uncertainty about ldquowhat is Hadooprdquo there is no reason to think solution stacks built on Spark not positioned as Hadoop will not continue to proliferate as the technology matures
bull At the same time Hadoop distributions are all embracing Spark and including it in their offerings
Source Hadoop Questions from Recent Webinar Span Spectrum February 25 2015httpblogsgartnercommerv-adrian20150225hadoop-questions-from-recent-webinar-span-spectrum
13
4 Analysts bull ldquoAfter hearing the confusion between Spark and Hadoop
one too many times I was inspired to write a report The Hadoop Ecosystem Overview Q4 2104
bull For those that have day jobs that donrsquot include constantly tracking Hadoop evolution I dove in and worked with Hadoop vendors and trusted consultants to create a framework
bull We divided the complex Hadoop ecosystem into a core set of tools that all work closely with data stored in Hadoop File System and extended group of components that leverage but do not require itrdquo Source Elephants Pigs Rhinos and Giraphs Oh My ndash Its Time To Get A Handle On Hadoop Posted by Brian Hopkins on November 26 2014
httpblogsforrestercombrian_hopkins14-11-26-elephants_pigs_rhinos_and_giraphs_oh_my_its_time_to_get_a_handle_on_hadoop
14
5 Key Takeaways1 News Big Data is no longer a Hadoop
monopoly2 Surveys Listen to what Spark developers are
saying 3 Vendors ltHadoop Vendorgt-tinted goggles
FUD is still being lsquoofferedrsquo by some Hadoop vendors Claims need to be contextualized
4 Analysts Thorough understanding of the market dynamics
15
II Big Data Typical Big Data Stack Hadoop Spark
1 Big Data2 Typical Big Data Stack 3 Apache Hadoop4 Apache Spark5 Key Takeaways
16
1 Big Databull Big Data is still one of the most inflated buzzword of the last years
bull Big Data is a broad term for data sets so large or complex that traditional data processing tools are inadequate httpenwikipediaorgwikiBig_data
bull Hadoop is becoming a traditional tool Above definition is inadequate
bull ldquoBig Data refers to datasets and flows large enough that has outpaced our capability to store process analyze and understandrdquo Amir H Payberah Swedish Institute of Computer Science (SICS)
17
2 Typical Big Data Stack
18
3 Apache Hadoopbull Apache Hadoop as an example of a Typical Big Data
Stack bull Hadoop ecosystem = Hadoop Stack + many other tools
(either open source and free or commercial ones)bull Big Data Ecosystem Dataset httpbigdataandreamostosiname
Incomplete but a useful list of Big Data related projects packed into a JSON dataset
bull Hadoops Impact on Data Managements Future - Amr Awadallah (Strata + Hadoop 2015) February 19 2015 Watch video at 236 on lsquoHadoop Isnrsquot Just Hadoop Anymorersquo for a picture representing the evolution of Apache Hadoop httpswwwyoutubecomwatchv=1KvTZZAkHy0
19
4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stackbull Apache Spark provides you Big Data computing and more
bull BYOS Bring Your Own Storage bull BYOC Bring Your Own Clusterbull Spark Core httpsparkbigdatacomcomponenttagstag11-core-sparkbull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-
streamingbull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sqlbull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllibbull GraphX httpsparkbigdatacomcomponenttagstag6-graphx
bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned
20
5 Key Takeaways1 Big Data Still one of the most inflated
buzzword 2 Typical Big Data Stack Big Data Stacks look
similar on paper Arenrsquot they3 Apache Hadoop Hadoop is no longer
lsquosynonymousrsquo of Big Data4 Apache Spark Emergence of the Apache
Spark ecosystem
21
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
22
1 Evolution of Programming APIsbull MapReduce in Java is like assembly code of Big
Data httpwikiapacheorghadoopWordCount
bull Pig httppigapacheorg
bull Hive httphiveapacheorg
bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi
bull Cascading httpwwwcascadingorg
bull Scalding A Scala API for Cascading httptwittercomscalding
bull Crunch httpcrunchapacheorg
bull Scrunch httpcrunchapacheorgscrunchhtml
23
1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007 MapReduce v1 was the only choice as a compute model (Execution Engine) on Hadoop Now we have in addition to MapReduce v2 Tez Spark and Flink
bull Batch bull Batchbull Interactive
bull Batchbull Interactivebull Near-Real
time
bull Batchbull Interactivebull Real-Timebull Iterative
bull 1st Generation
bull 2nd
Generationbull 3rd
Generationbull 4th
Generation
24
1 Evolutionbull This is how Hadoop MapReduce is branding itself ldquoA YARN-based
system for parallel processing of large data sets httphadoopapacheorg
bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip
bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job
bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics
25
1 Evolutionbull Tez Hindi for ldquospeedrdquobull This is how Apache Tez is branding itself ldquoThe Apache Tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data It is currently built atop YARNrdquo
Source httptezapacheorg
bull Apachetrade Tez is an extensible framework for building high performance batch and interactive data processing applications coordinated by YARN in Apache Hadoop
26
1 Evolution bull lsquoSparkrsquo for lightning fast speedbull This is how Apache Spark is branding itself ldquoApache Sparktrade is a fast and general engine for large-scale data processingrdquo httpssparkapacheorg
bull Apache Spark is a general purpose cluster computing framework its execution model supports wide variety of use cases batch interactive near-real time
bull The rapid in-memory processing of resilient distributed datasets (RDDs) is the ldquocore capabilityrdquo of Apache Spark
27
1 Evolution Apache Flinkbull Flink German for ldquonimble swift speedyrdquobull This is how Apache Flink is branding itself ldquoFast and
reliable large-scale data processing enginerdquobull Apache Flink httpflinkapacheorg offers
bull Batch and Streaming in the same systembull Beyond DAGs (Cyclic operator graphs)bull Powerful expressive APIsbull Inside-the-system iterationsbull Full Hadoop compatibility bull Automatic language independent optimizer
bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink
28
Hadoop MapReduce vs Tez vs SparkCriteria
License Open SourceApache 20 version 2x
Open SourceApache 20 version 0x
Open SourceApache 20 version 1x
Processing Model
On-Disk (Disk- based parallelization) Batch
On-Disk Batch Interactive
In-Memory On-Disk Batch Interactive Streaming (Near Real-Time)
Language written in
Java Java Scala
API [Java Python Scala] User-Facing
Java[ ISVEngineTool builder]
[Scala Java Python] User-Facing
Libraries None separate tools None [Spark Core Spark Streaming Spark SQL MLlib GraphX]
29
Hadoop MapReduce vs Tez vs SparkCriteria
Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to Hadoop
Ease of Use Difficult to program needs abstractions
No Interactive mode except Hive Pig
Difficult to program
No Interactive mode except Hive Pig
Easy to program no need of abstractionsInteractive mode
Compatibility
to data types and data sources is same
to data types and data sources is same
to data types and data sources is same
YARN integration
YARN application Ground up YARN application
Spark is moving towards YARN
30
Hadoop MapReduce vs Tez vs SparkCriteria
Deployment YARN YARN [Standalone YARN SIMR Mesos hellip]
Performance - Good performance when data fits into memory
- performance degradation otherwise
Security More features and projects
More features and projects
Still in its infancy
Partial support
31
IV Spark with Hadoop
1 Evolution2 Transition3 Integration4 Complementarity5 Key Takeaways
32
2 Transitionbull Existing Hadoop MapReduce projects can migrate to Spark and leverage Spark Core as execution engine1 You can often reuse your mapper and
reducer functions and just call them in Spark from Java or Scala
2 You can translate your code from MapReduce to Apache Spark How-to Translate from MapReduce to Apache Spark
httpblogclouderacomblog201409how-to-translate-from-mapreduce-to-apache-spark
33
2 Transition3 The following tools originally based on Hadoop MapReduce are being ported to Apache Spark
bull Pig Hive Sqoop Cascading Crunch Mahout hellip
34
Pig on Spark (Spork)bull Run Pig with ldquondashx sparkrdquo option for an easy migration
without development effortbull Speed up your existing pig scripts on Spark ( Query
Logical Plan Physical Pan)bull Leverage new Spark specific operators in Pig such as
Cachebull Still leverage many existing Pig UDF librariesbull Pig on Spark Umbrella Jira (Status Passed end-to-end test
cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059bull Fix outstanding issues and address additional Spark functionality
through the community
bull lsquoPig on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag19
35
Hive on Spark (Currently in Beta Expected in Hive 110)
bull New alternative to using MapReduce or Tez hivegt set hiveexecutionengine=sparkbull Help existing Hive applications running on
MapReduce or Tez easily migrate to Spark without development effort
bull Exposes Spark users to a viable feature-rich de facto standard SQL tool on Hadoop
bull Performance benefits especially for Hive queries involving multiple reducer stages
bull Hive on Spark Umbrella Jira (Status Open) Q1 2015 httpsissuesapacheorgjirabrowseHIVE-7292
36
Hive on Spark (Currently in Beta Expected in Hive 110)
bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-motivations-and-design-principles
bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Started
bull Hive on Spark February 11 2015 Szehon Ho Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark
bull Hive on spark is blazing fast or is it Carter Shanklin and Mostapah Mokhtar (Hortonworks) February 20 2015 httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final
bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12
37
Sqoop on Spark (Expected in Sqoop 2)
bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop
bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources
bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal
bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532
38
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop
bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark
39
Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml
40
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
41
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
42
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
43
3 IntegrationService Open Source Tool
StorageServing Layer
Data Formats
Data Ingestion ServicesResource Management
Search
SQL
44
3 Integrationbull Spark was designed to read and write data from and to HDFS as
well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851
45
3 Integrationbull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase
46
3 Integration bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra
47
3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new avenues
bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra
48
3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo
httpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights
bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
49
3 Integration bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu
ction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
50
3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph
databasebull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
51
3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A
bull Some issues are critical ones bull Running Spark on YARN
httpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
52
3 Integrationbull Spark SQL provides built in support for Hive tables
bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib
53
3 Integration bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline
Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
54
3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka
55
3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
56
3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data
bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml
57
3 Integrationbull Apache Parquet is a columnar storage format available
to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
58
3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark
SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem
bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format
59
3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
60
3 Integration bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html
61
3 Integration
bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from
MapReduce to Sparkbull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed
62
3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014
63
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
64
4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them
Hadoop ecosystem Spark ecosystem
65
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
66
4 Complementarity + bull Mesos and YARN can work together each for what
it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom
httpsparkbigdatacomcomponenttagstag41
67
4 Complementarity + References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management
bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620
68
4 Complementarity + bull Spark on Tez for efficient ETL https
githubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)
bull Tez supports enterprise security
69
4 Complementarity + bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
70
4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml
71
4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop
72
5 Key Takeaways1 Evolution of compute models is still ongoing Watch
out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
73
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
74
1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt
he-perfect-match-apache-spark-meets-swift
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
10
3 Vendorsbull ldquoApache Spark is a general-purpose engine for large-
scale data processing Spark supports rapid application development for big data and allows for code reuse across batch interactive and streaming applications Spark also provides advanced execution graphs with in-memory pipelining to speed up end-to-end application performancerdquo httpswwwmaprcomproductsapache-spark
bull MapR Adds Complete Apache Spark Stack to its Distribution for Hadoophttpswwwmaprcomcompanypress-releasesmapr-adds-complete-apache-spark-stack-its-distribution-hadoop
11
3 Vendorsbull ldquoApache Spark provides an elegant attractive development API and allows data workers to rapidly iterate over data via machine learning and other data science techniques that require fast in-memory data processingrdquo httphortonworkscomhadoopspark
bull Hortonworks A shared vision for Apache Spark on Hadoop October 21 2014httpsdatabrickscomblog20141031hortonworks-a-shared-vision-for-apache-spark-on-hadoophtml
bull ldquoAt Hortonworks we love Spark and want to help our customers leverage all its benefitsrdquo October 30th 2014httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
12
4 Analystsbull Is Apache Spark replacing Hadoop or complementing
existing Hadoop practicebull Both are already happening
bull With uncertainty about ldquowhat is Hadooprdquo there is no reason to think solution stacks built on Spark not positioned as Hadoop will not continue to proliferate as the technology matures
bull At the same time Hadoop distributions are all embracing Spark and including it in their offerings
Source Hadoop Questions from Recent Webinar Span Spectrum February 25 2015httpblogsgartnercommerv-adrian20150225hadoop-questions-from-recent-webinar-span-spectrum
13
4 Analysts bull ldquoAfter hearing the confusion between Spark and Hadoop
one too many times I was inspired to write a report The Hadoop Ecosystem Overview Q4 2104
bull For those that have day jobs that donrsquot include constantly tracking Hadoop evolution I dove in and worked with Hadoop vendors and trusted consultants to create a framework
bull We divided the complex Hadoop ecosystem into a core set of tools that all work closely with data stored in Hadoop File System and extended group of components that leverage but do not require itrdquo Source Elephants Pigs Rhinos and Giraphs Oh My ndash Its Time To Get A Handle On Hadoop Posted by Brian Hopkins on November 26 2014
httpblogsforrestercombrian_hopkins14-11-26-elephants_pigs_rhinos_and_giraphs_oh_my_its_time_to_get_a_handle_on_hadoop
14
5 Key Takeaways1 News Big Data is no longer a Hadoop
monopoly2 Surveys Listen to what Spark developers are
saying 3 Vendors ltHadoop Vendorgt-tinted goggles
FUD is still being lsquoofferedrsquo by some Hadoop vendors Claims need to be contextualized
4 Analysts Thorough understanding of the market dynamics
15
II Big Data Typical Big Data Stack Hadoop Spark
1 Big Data2 Typical Big Data Stack 3 Apache Hadoop4 Apache Spark5 Key Takeaways
16
1 Big Databull Big Data is still one of the most inflated buzzword of the last years
bull Big Data is a broad term for data sets so large or complex that traditional data processing tools are inadequate httpenwikipediaorgwikiBig_data
bull Hadoop is becoming a traditional tool Above definition is inadequate
bull ldquoBig Data refers to datasets and flows large enough that has outpaced our capability to store process analyze and understandrdquo Amir H Payberah Swedish Institute of Computer Science (SICS)
17
2 Typical Big Data Stack
18
3 Apache Hadoopbull Apache Hadoop as an example of a Typical Big Data
Stack bull Hadoop ecosystem = Hadoop Stack + many other tools
(either open source and free or commercial ones)bull Big Data Ecosystem Dataset httpbigdataandreamostosiname
Incomplete but a useful list of Big Data related projects packed into a JSON dataset
bull Hadoops Impact on Data Managements Future - Amr Awadallah (Strata + Hadoop 2015) February 19 2015 Watch video at 236 on lsquoHadoop Isnrsquot Just Hadoop Anymorersquo for a picture representing the evolution of Apache Hadoop httpswwwyoutubecomwatchv=1KvTZZAkHy0
19
4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stackbull Apache Spark provides you Big Data computing and more
bull BYOS Bring Your Own Storage bull BYOC Bring Your Own Clusterbull Spark Core httpsparkbigdatacomcomponenttagstag11-core-sparkbull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-
streamingbull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sqlbull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllibbull GraphX httpsparkbigdatacomcomponenttagstag6-graphx
bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned
20
5 Key Takeaways1 Big Data Still one of the most inflated
buzzword 2 Typical Big Data Stack Big Data Stacks look
similar on paper Arenrsquot they3 Apache Hadoop Hadoop is no longer
lsquosynonymousrsquo of Big Data4 Apache Spark Emergence of the Apache
Spark ecosystem
21
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
22
1 Evolution of Programming APIsbull MapReduce in Java is like assembly code of Big
Data httpwikiapacheorghadoopWordCount
bull Pig httppigapacheorg
bull Hive httphiveapacheorg
bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi
bull Cascading httpwwwcascadingorg
bull Scalding A Scala API for Cascading httptwittercomscalding
bull Crunch httpcrunchapacheorg
bull Scrunch httpcrunchapacheorgscrunchhtml
23
1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007 MapReduce v1 was the only choice as a compute model (Execution Engine) on Hadoop Now we have in addition to MapReduce v2 Tez Spark and Flink
bull Batch bull Batchbull Interactive
bull Batchbull Interactivebull Near-Real
time
bull Batchbull Interactivebull Real-Timebull Iterative
bull 1st Generation
bull 2nd
Generationbull 3rd
Generationbull 4th
Generation
24
1 Evolutionbull This is how Hadoop MapReduce is branding itself ldquoA YARN-based
system for parallel processing of large data sets httphadoopapacheorg
bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip
bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job
bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics
25
1 Evolutionbull Tez Hindi for ldquospeedrdquobull This is how Apache Tez is branding itself ldquoThe Apache Tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data It is currently built atop YARNrdquo
Source httptezapacheorg
bull Apachetrade Tez is an extensible framework for building high performance batch and interactive data processing applications coordinated by YARN in Apache Hadoop
26
1 Evolution bull lsquoSparkrsquo for lightning fast speedbull This is how Apache Spark is branding itself ldquoApache Sparktrade is a fast and general engine for large-scale data processingrdquo httpssparkapacheorg
bull Apache Spark is a general purpose cluster computing framework its execution model supports wide variety of use cases batch interactive near-real time
bull The rapid in-memory processing of resilient distributed datasets (RDDs) is the ldquocore capabilityrdquo of Apache Spark
27
1 Evolution Apache Flinkbull Flink German for ldquonimble swift speedyrdquobull This is how Apache Flink is branding itself ldquoFast and
reliable large-scale data processing enginerdquobull Apache Flink httpflinkapacheorg offers
bull Batch and Streaming in the same systembull Beyond DAGs (Cyclic operator graphs)bull Powerful expressive APIsbull Inside-the-system iterationsbull Full Hadoop compatibility bull Automatic language independent optimizer
bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink
28
Hadoop MapReduce vs Tez vs SparkCriteria
License Open SourceApache 20 version 2x
Open SourceApache 20 version 0x
Open SourceApache 20 version 1x
Processing Model
On-Disk (Disk- based parallelization) Batch
On-Disk Batch Interactive
In-Memory On-Disk Batch Interactive Streaming (Near Real-Time)
Language written in
Java Java Scala
API [Java Python Scala] User-Facing
Java[ ISVEngineTool builder]
[Scala Java Python] User-Facing
Libraries None separate tools None [Spark Core Spark Streaming Spark SQL MLlib GraphX]
29
Hadoop MapReduce vs Tez vs SparkCriteria
Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to Hadoop
Ease of Use Difficult to program needs abstractions
No Interactive mode except Hive Pig
Difficult to program
No Interactive mode except Hive Pig
Easy to program no need of abstractionsInteractive mode
Compatibility
to data types and data sources is same
to data types and data sources is same
to data types and data sources is same
YARN integration
YARN application Ground up YARN application
Spark is moving towards YARN
30
Hadoop MapReduce vs Tez vs SparkCriteria
Deployment YARN YARN [Standalone YARN SIMR Mesos hellip]
Performance - Good performance when data fits into memory
- performance degradation otherwise
Security More features and projects
More features and projects
Still in its infancy
Partial support
31
IV Spark with Hadoop
1 Evolution2 Transition3 Integration4 Complementarity5 Key Takeaways
32
2 Transitionbull Existing Hadoop MapReduce projects can migrate to Spark and leverage Spark Core as execution engine1 You can often reuse your mapper and
reducer functions and just call them in Spark from Java or Scala
2 You can translate your code from MapReduce to Apache Spark How-to Translate from MapReduce to Apache Spark
httpblogclouderacomblog201409how-to-translate-from-mapreduce-to-apache-spark
33
2 Transition3 The following tools originally based on Hadoop MapReduce are being ported to Apache Spark
bull Pig Hive Sqoop Cascading Crunch Mahout hellip
34
Pig on Spark (Spork)bull Run Pig with ldquondashx sparkrdquo option for an easy migration
without development effortbull Speed up your existing pig scripts on Spark ( Query
Logical Plan Physical Pan)bull Leverage new Spark specific operators in Pig such as
Cachebull Still leverage many existing Pig UDF librariesbull Pig on Spark Umbrella Jira (Status Passed end-to-end test
cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059bull Fix outstanding issues and address additional Spark functionality
through the community
bull lsquoPig on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag19
35
Hive on Spark (Currently in Beta Expected in Hive 110)
bull New alternative to using MapReduce or Tez hivegt set hiveexecutionengine=sparkbull Help existing Hive applications running on
MapReduce or Tez easily migrate to Spark without development effort
bull Exposes Spark users to a viable feature-rich de facto standard SQL tool on Hadoop
bull Performance benefits especially for Hive queries involving multiple reducer stages
bull Hive on Spark Umbrella Jira (Status Open) Q1 2015 httpsissuesapacheorgjirabrowseHIVE-7292
36
Hive on Spark (Currently in Beta Expected in Hive 110)
bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-motivations-and-design-principles
bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Started
bull Hive on Spark February 11 2015 Szehon Ho Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark
bull Hive on spark is blazing fast or is it Carter Shanklin and Mostapah Mokhtar (Hortonworks) February 20 2015 httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final
bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12
37
Sqoop on Spark (Expected in Sqoop 2)
bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop
bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources
bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal
bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532
38
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop
bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark
39
Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml
40
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
41
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
42
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
43
3 IntegrationService Open Source Tool
StorageServing Layer
Data Formats
Data Ingestion ServicesResource Management
Search
SQL
44
3 Integrationbull Spark was designed to read and write data from and to HDFS as
well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851
45
3 Integrationbull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase
46
3 Integration bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra
47
3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new avenues
bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra
48
3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo
httpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights
bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
49
3 Integration bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu
ction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
50
3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph
databasebull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
51
3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A
bull Some issues are critical ones bull Running Spark on YARN
httpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
52
3 Integrationbull Spark SQL provides built in support for Hive tables
bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib
53
3 Integration bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline
Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
54
3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka
55
3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
56
3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data
bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml
57
3 Integrationbull Apache Parquet is a columnar storage format available
to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
58
3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark
SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem
bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format
59
3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
60
3 Integration bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html
61
3 Integration
bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from
MapReduce to Sparkbull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed
62
3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014
63
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
64
4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them
Hadoop ecosystem Spark ecosystem
65
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
66
4 Complementarity + bull Mesos and YARN can work together each for what
it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom
httpsparkbigdatacomcomponenttagstag41
67
4 Complementarity + References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management
bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620
68
4 Complementarity + bull Spark on Tez for efficient ETL https
githubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)
bull Tez supports enterprise security
69
4 Complementarity + bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
70
4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml
71
4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop
72
5 Key Takeaways1 Evolution of compute models is still ongoing Watch
out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
73
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
74
1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt
he-perfect-match-apache-spark-meets-swift
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
11
3 Vendorsbull ldquoApache Spark provides an elegant attractive development API and allows data workers to rapidly iterate over data via machine learning and other data science techniques that require fast in-memory data processingrdquo httphortonworkscomhadoopspark
bull Hortonworks A shared vision for Apache Spark on Hadoop October 21 2014httpsdatabrickscomblog20141031hortonworks-a-shared-vision-for-apache-spark-on-hadoophtml
bull ldquoAt Hortonworks we love Spark and want to help our customers leverage all its benefitsrdquo October 30th 2014httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
12
4 Analystsbull Is Apache Spark replacing Hadoop or complementing
existing Hadoop practicebull Both are already happening
bull With uncertainty about ldquowhat is Hadooprdquo there is no reason to think solution stacks built on Spark not positioned as Hadoop will not continue to proliferate as the technology matures
bull At the same time Hadoop distributions are all embracing Spark and including it in their offerings
Source Hadoop Questions from Recent Webinar Span Spectrum February 25 2015httpblogsgartnercommerv-adrian20150225hadoop-questions-from-recent-webinar-span-spectrum
13
4 Analysts bull ldquoAfter hearing the confusion between Spark and Hadoop
one too many times I was inspired to write a report The Hadoop Ecosystem Overview Q4 2104
bull For those that have day jobs that donrsquot include constantly tracking Hadoop evolution I dove in and worked with Hadoop vendors and trusted consultants to create a framework
bull We divided the complex Hadoop ecosystem into a core set of tools that all work closely with data stored in Hadoop File System and extended group of components that leverage but do not require itrdquo Source Elephants Pigs Rhinos and Giraphs Oh My ndash Its Time To Get A Handle On Hadoop Posted by Brian Hopkins on November 26 2014
httpblogsforrestercombrian_hopkins14-11-26-elephants_pigs_rhinos_and_giraphs_oh_my_its_time_to_get_a_handle_on_hadoop
14
5 Key Takeaways1 News Big Data is no longer a Hadoop
monopoly2 Surveys Listen to what Spark developers are
saying 3 Vendors ltHadoop Vendorgt-tinted goggles
FUD is still being lsquoofferedrsquo by some Hadoop vendors Claims need to be contextualized
4 Analysts Thorough understanding of the market dynamics
15
II Big Data Typical Big Data Stack Hadoop Spark
1 Big Data2 Typical Big Data Stack 3 Apache Hadoop4 Apache Spark5 Key Takeaways
16
1 Big Databull Big Data is still one of the most inflated buzzword of the last years
bull Big Data is a broad term for data sets so large or complex that traditional data processing tools are inadequate httpenwikipediaorgwikiBig_data
bull Hadoop is becoming a traditional tool Above definition is inadequate
bull ldquoBig Data refers to datasets and flows large enough that has outpaced our capability to store process analyze and understandrdquo Amir H Payberah Swedish Institute of Computer Science (SICS)
17
2 Typical Big Data Stack
18
3 Apache Hadoopbull Apache Hadoop as an example of a Typical Big Data
Stack bull Hadoop ecosystem = Hadoop Stack + many other tools
(either open source and free or commercial ones)bull Big Data Ecosystem Dataset httpbigdataandreamostosiname
Incomplete but a useful list of Big Data related projects packed into a JSON dataset
bull Hadoops Impact on Data Managements Future - Amr Awadallah (Strata + Hadoop 2015) February 19 2015 Watch video at 236 on lsquoHadoop Isnrsquot Just Hadoop Anymorersquo for a picture representing the evolution of Apache Hadoop httpswwwyoutubecomwatchv=1KvTZZAkHy0
19
4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stackbull Apache Spark provides you Big Data computing and more
bull BYOS Bring Your Own Storage bull BYOC Bring Your Own Clusterbull Spark Core httpsparkbigdatacomcomponenttagstag11-core-sparkbull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-
streamingbull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sqlbull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllibbull GraphX httpsparkbigdatacomcomponenttagstag6-graphx
bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned
20
5 Key Takeaways1 Big Data Still one of the most inflated
buzzword 2 Typical Big Data Stack Big Data Stacks look
similar on paper Arenrsquot they3 Apache Hadoop Hadoop is no longer
lsquosynonymousrsquo of Big Data4 Apache Spark Emergence of the Apache
Spark ecosystem
21
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
22
1 Evolution of Programming APIsbull MapReduce in Java is like assembly code of Big
Data httpwikiapacheorghadoopWordCount
bull Pig httppigapacheorg
bull Hive httphiveapacheorg
bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi
bull Cascading httpwwwcascadingorg
bull Scalding A Scala API for Cascading httptwittercomscalding
bull Crunch httpcrunchapacheorg
bull Scrunch httpcrunchapacheorgscrunchhtml
23
1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007 MapReduce v1 was the only choice as a compute model (Execution Engine) on Hadoop Now we have in addition to MapReduce v2 Tez Spark and Flink
bull Batch bull Batchbull Interactive
bull Batchbull Interactivebull Near-Real
time
bull Batchbull Interactivebull Real-Timebull Iterative
bull 1st Generation
bull 2nd
Generationbull 3rd
Generationbull 4th
Generation
24
1 Evolutionbull This is how Hadoop MapReduce is branding itself ldquoA YARN-based
system for parallel processing of large data sets httphadoopapacheorg
bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip
bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job
bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics
25
1 Evolutionbull Tez Hindi for ldquospeedrdquobull This is how Apache Tez is branding itself ldquoThe Apache Tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data It is currently built atop YARNrdquo
Source httptezapacheorg
bull Apachetrade Tez is an extensible framework for building high performance batch and interactive data processing applications coordinated by YARN in Apache Hadoop
26
1 Evolution bull lsquoSparkrsquo for lightning fast speedbull This is how Apache Spark is branding itself ldquoApache Sparktrade is a fast and general engine for large-scale data processingrdquo httpssparkapacheorg
bull Apache Spark is a general purpose cluster computing framework its execution model supports wide variety of use cases batch interactive near-real time
bull The rapid in-memory processing of resilient distributed datasets (RDDs) is the ldquocore capabilityrdquo of Apache Spark
27
1 Evolution Apache Flinkbull Flink German for ldquonimble swift speedyrdquobull This is how Apache Flink is branding itself ldquoFast and
reliable large-scale data processing enginerdquobull Apache Flink httpflinkapacheorg offers
bull Batch and Streaming in the same systembull Beyond DAGs (Cyclic operator graphs)bull Powerful expressive APIsbull Inside-the-system iterationsbull Full Hadoop compatibility bull Automatic language independent optimizer
bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink
28
Hadoop MapReduce vs Tez vs SparkCriteria
License Open SourceApache 20 version 2x
Open SourceApache 20 version 0x
Open SourceApache 20 version 1x
Processing Model
On-Disk (Disk- based parallelization) Batch
On-Disk Batch Interactive
In-Memory On-Disk Batch Interactive Streaming (Near Real-Time)
Language written in
Java Java Scala
API [Java Python Scala] User-Facing
Java[ ISVEngineTool builder]
[Scala Java Python] User-Facing
Libraries None separate tools None [Spark Core Spark Streaming Spark SQL MLlib GraphX]
29
Hadoop MapReduce vs Tez vs SparkCriteria
Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to Hadoop
Ease of Use Difficult to program needs abstractions
No Interactive mode except Hive Pig
Difficult to program
No Interactive mode except Hive Pig
Easy to program no need of abstractionsInteractive mode
Compatibility
to data types and data sources is same
to data types and data sources is same
to data types and data sources is same
YARN integration
YARN application Ground up YARN application
Spark is moving towards YARN
30
Hadoop MapReduce vs Tez vs SparkCriteria
Deployment YARN YARN [Standalone YARN SIMR Mesos hellip]
Performance - Good performance when data fits into memory
- performance degradation otherwise
Security More features and projects
More features and projects
Still in its infancy
Partial support
31
IV Spark with Hadoop
1 Evolution2 Transition3 Integration4 Complementarity5 Key Takeaways
32
2 Transitionbull Existing Hadoop MapReduce projects can migrate to Spark and leverage Spark Core as execution engine1 You can often reuse your mapper and
reducer functions and just call them in Spark from Java or Scala
2 You can translate your code from MapReduce to Apache Spark How-to Translate from MapReduce to Apache Spark
httpblogclouderacomblog201409how-to-translate-from-mapreduce-to-apache-spark
33
2 Transition3 The following tools originally based on Hadoop MapReduce are being ported to Apache Spark
bull Pig Hive Sqoop Cascading Crunch Mahout hellip
34
Pig on Spark (Spork)bull Run Pig with ldquondashx sparkrdquo option for an easy migration
without development effortbull Speed up your existing pig scripts on Spark ( Query
Logical Plan Physical Pan)bull Leverage new Spark specific operators in Pig such as
Cachebull Still leverage many existing Pig UDF librariesbull Pig on Spark Umbrella Jira (Status Passed end-to-end test
cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059bull Fix outstanding issues and address additional Spark functionality
through the community
bull lsquoPig on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag19
35
Hive on Spark (Currently in Beta Expected in Hive 110)
bull New alternative to using MapReduce or Tez hivegt set hiveexecutionengine=sparkbull Help existing Hive applications running on
MapReduce or Tez easily migrate to Spark without development effort
bull Exposes Spark users to a viable feature-rich de facto standard SQL tool on Hadoop
bull Performance benefits especially for Hive queries involving multiple reducer stages
bull Hive on Spark Umbrella Jira (Status Open) Q1 2015 httpsissuesapacheorgjirabrowseHIVE-7292
36
Hive on Spark (Currently in Beta Expected in Hive 110)
bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-motivations-and-design-principles
bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Started
bull Hive on Spark February 11 2015 Szehon Ho Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark
bull Hive on spark is blazing fast or is it Carter Shanklin and Mostapah Mokhtar (Hortonworks) February 20 2015 httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final
bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12
37
Sqoop on Spark (Expected in Sqoop 2)
bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop
bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources
bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal
bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532
38
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop
bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark
39
Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml
40
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
41
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
42
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
43
3 IntegrationService Open Source Tool
StorageServing Layer
Data Formats
Data Ingestion ServicesResource Management
Search
SQL
44
3 Integrationbull Spark was designed to read and write data from and to HDFS as
well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851
45
3 Integrationbull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase
46
3 Integration bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra
47
3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new avenues
bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra
48
3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo
httpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights
bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
49
3 Integration bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu
ction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
50
3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph
databasebull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
51
3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A
bull Some issues are critical ones bull Running Spark on YARN
httpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
52
3 Integrationbull Spark SQL provides built in support for Hive tables
bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib
53
3 Integration bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline
Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
54
3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka
55
3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
56
3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data
bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml
57
3 Integrationbull Apache Parquet is a columnar storage format available
to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
58
3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark
SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem
bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format
59
3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
60
3 Integration bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html
61
3 Integration
bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from
MapReduce to Sparkbull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed
62
3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014
63
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
64
4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them
Hadoop ecosystem Spark ecosystem
65
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
66
4 Complementarity + bull Mesos and YARN can work together each for what
it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom
httpsparkbigdatacomcomponenttagstag41
67
4 Complementarity + References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management
bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620
68
4 Complementarity + bull Spark on Tez for efficient ETL https
githubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)
bull Tez supports enterprise security
69
4 Complementarity + bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
70
4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml
71
4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop
72
5 Key Takeaways1 Evolution of compute models is still ongoing Watch
out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
73
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
74
1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt
he-perfect-match-apache-spark-meets-swift
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
12
4 Analystsbull Is Apache Spark replacing Hadoop or complementing
existing Hadoop practicebull Both are already happening
bull With uncertainty about ldquowhat is Hadooprdquo there is no reason to think solution stacks built on Spark not positioned as Hadoop will not continue to proliferate as the technology matures
bull At the same time Hadoop distributions are all embracing Spark and including it in their offerings
Source Hadoop Questions from Recent Webinar Span Spectrum February 25 2015httpblogsgartnercommerv-adrian20150225hadoop-questions-from-recent-webinar-span-spectrum
13
4 Analysts bull ldquoAfter hearing the confusion between Spark and Hadoop
one too many times I was inspired to write a report The Hadoop Ecosystem Overview Q4 2104
bull For those that have day jobs that donrsquot include constantly tracking Hadoop evolution I dove in and worked with Hadoop vendors and trusted consultants to create a framework
bull We divided the complex Hadoop ecosystem into a core set of tools that all work closely with data stored in Hadoop File System and extended group of components that leverage but do not require itrdquo Source Elephants Pigs Rhinos and Giraphs Oh My ndash Its Time To Get A Handle On Hadoop Posted by Brian Hopkins on November 26 2014
httpblogsforrestercombrian_hopkins14-11-26-elephants_pigs_rhinos_and_giraphs_oh_my_its_time_to_get_a_handle_on_hadoop
14
5 Key Takeaways1 News Big Data is no longer a Hadoop
monopoly2 Surveys Listen to what Spark developers are
saying 3 Vendors ltHadoop Vendorgt-tinted goggles
FUD is still being lsquoofferedrsquo by some Hadoop vendors Claims need to be contextualized
4 Analysts Thorough understanding of the market dynamics
15
II Big Data Typical Big Data Stack Hadoop Spark
1 Big Data2 Typical Big Data Stack 3 Apache Hadoop4 Apache Spark5 Key Takeaways
16
1 Big Databull Big Data is still one of the most inflated buzzword of the last years
bull Big Data is a broad term for data sets so large or complex that traditional data processing tools are inadequate httpenwikipediaorgwikiBig_data
bull Hadoop is becoming a traditional tool Above definition is inadequate
bull ldquoBig Data refers to datasets and flows large enough that has outpaced our capability to store process analyze and understandrdquo Amir H Payberah Swedish Institute of Computer Science (SICS)
17
2 Typical Big Data Stack
18
3 Apache Hadoopbull Apache Hadoop as an example of a Typical Big Data
Stack bull Hadoop ecosystem = Hadoop Stack + many other tools
(either open source and free or commercial ones)bull Big Data Ecosystem Dataset httpbigdataandreamostosiname
Incomplete but a useful list of Big Data related projects packed into a JSON dataset
bull Hadoops Impact on Data Managements Future - Amr Awadallah (Strata + Hadoop 2015) February 19 2015 Watch video at 236 on lsquoHadoop Isnrsquot Just Hadoop Anymorersquo for a picture representing the evolution of Apache Hadoop httpswwwyoutubecomwatchv=1KvTZZAkHy0
19
4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stackbull Apache Spark provides you Big Data computing and more
bull BYOS Bring Your Own Storage bull BYOC Bring Your Own Clusterbull Spark Core httpsparkbigdatacomcomponenttagstag11-core-sparkbull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-
streamingbull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sqlbull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllibbull GraphX httpsparkbigdatacomcomponenttagstag6-graphx
bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned
20
5 Key Takeaways1 Big Data Still one of the most inflated
buzzword 2 Typical Big Data Stack Big Data Stacks look
similar on paper Arenrsquot they3 Apache Hadoop Hadoop is no longer
lsquosynonymousrsquo of Big Data4 Apache Spark Emergence of the Apache
Spark ecosystem
21
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
22
1 Evolution of Programming APIsbull MapReduce in Java is like assembly code of Big
Data httpwikiapacheorghadoopWordCount
bull Pig httppigapacheorg
bull Hive httphiveapacheorg
bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi
bull Cascading httpwwwcascadingorg
bull Scalding A Scala API for Cascading httptwittercomscalding
bull Crunch httpcrunchapacheorg
bull Scrunch httpcrunchapacheorgscrunchhtml
23
1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007 MapReduce v1 was the only choice as a compute model (Execution Engine) on Hadoop Now we have in addition to MapReduce v2 Tez Spark and Flink
bull Batch bull Batchbull Interactive
bull Batchbull Interactivebull Near-Real
time
bull Batchbull Interactivebull Real-Timebull Iterative
bull 1st Generation
bull 2nd
Generationbull 3rd
Generationbull 4th
Generation
24
1 Evolutionbull This is how Hadoop MapReduce is branding itself ldquoA YARN-based
system for parallel processing of large data sets httphadoopapacheorg
bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip
bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job
bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics
25
1 Evolutionbull Tez Hindi for ldquospeedrdquobull This is how Apache Tez is branding itself ldquoThe Apache Tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data It is currently built atop YARNrdquo
Source httptezapacheorg
bull Apachetrade Tez is an extensible framework for building high performance batch and interactive data processing applications coordinated by YARN in Apache Hadoop
26
1 Evolution bull lsquoSparkrsquo for lightning fast speedbull This is how Apache Spark is branding itself ldquoApache Sparktrade is a fast and general engine for large-scale data processingrdquo httpssparkapacheorg
bull Apache Spark is a general purpose cluster computing framework its execution model supports wide variety of use cases batch interactive near-real time
bull The rapid in-memory processing of resilient distributed datasets (RDDs) is the ldquocore capabilityrdquo of Apache Spark
27
1 Evolution Apache Flinkbull Flink German for ldquonimble swift speedyrdquobull This is how Apache Flink is branding itself ldquoFast and
reliable large-scale data processing enginerdquobull Apache Flink httpflinkapacheorg offers
bull Batch and Streaming in the same systembull Beyond DAGs (Cyclic operator graphs)bull Powerful expressive APIsbull Inside-the-system iterationsbull Full Hadoop compatibility bull Automatic language independent optimizer
bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink
28
Hadoop MapReduce vs Tez vs SparkCriteria
License Open SourceApache 20 version 2x
Open SourceApache 20 version 0x
Open SourceApache 20 version 1x
Processing Model
On-Disk (Disk- based parallelization) Batch
On-Disk Batch Interactive
In-Memory On-Disk Batch Interactive Streaming (Near Real-Time)
Language written in
Java Java Scala
API [Java Python Scala] User-Facing
Java[ ISVEngineTool builder]
[Scala Java Python] User-Facing
Libraries None separate tools None [Spark Core Spark Streaming Spark SQL MLlib GraphX]
29
Hadoop MapReduce vs Tez vs SparkCriteria
Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to Hadoop
Ease of Use Difficult to program needs abstractions
No Interactive mode except Hive Pig
Difficult to program
No Interactive mode except Hive Pig
Easy to program no need of abstractionsInteractive mode
Compatibility
to data types and data sources is same
to data types and data sources is same
to data types and data sources is same
YARN integration
YARN application Ground up YARN application
Spark is moving towards YARN
30
Hadoop MapReduce vs Tez vs SparkCriteria
Deployment YARN YARN [Standalone YARN SIMR Mesos hellip]
Performance - Good performance when data fits into memory
- performance degradation otherwise
Security More features and projects
More features and projects
Still in its infancy
Partial support
31
IV Spark with Hadoop
1 Evolution2 Transition3 Integration4 Complementarity5 Key Takeaways
32
2 Transitionbull Existing Hadoop MapReduce projects can migrate to Spark and leverage Spark Core as execution engine1 You can often reuse your mapper and
reducer functions and just call them in Spark from Java or Scala
2 You can translate your code from MapReduce to Apache Spark How-to Translate from MapReduce to Apache Spark
httpblogclouderacomblog201409how-to-translate-from-mapreduce-to-apache-spark
33
2 Transition3 The following tools originally based on Hadoop MapReduce are being ported to Apache Spark
bull Pig Hive Sqoop Cascading Crunch Mahout hellip
34
Pig on Spark (Spork)bull Run Pig with ldquondashx sparkrdquo option for an easy migration
without development effortbull Speed up your existing pig scripts on Spark ( Query
Logical Plan Physical Pan)bull Leverage new Spark specific operators in Pig such as
Cachebull Still leverage many existing Pig UDF librariesbull Pig on Spark Umbrella Jira (Status Passed end-to-end test
cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059bull Fix outstanding issues and address additional Spark functionality
through the community
bull lsquoPig on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag19
35
Hive on Spark (Currently in Beta Expected in Hive 110)
bull New alternative to using MapReduce or Tez hivegt set hiveexecutionengine=sparkbull Help existing Hive applications running on
MapReduce or Tez easily migrate to Spark without development effort
bull Exposes Spark users to a viable feature-rich de facto standard SQL tool on Hadoop
bull Performance benefits especially for Hive queries involving multiple reducer stages
bull Hive on Spark Umbrella Jira (Status Open) Q1 2015 httpsissuesapacheorgjirabrowseHIVE-7292
36
Hive on Spark (Currently in Beta Expected in Hive 110)
bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-motivations-and-design-principles
bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Started
bull Hive on Spark February 11 2015 Szehon Ho Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark
bull Hive on spark is blazing fast or is it Carter Shanklin and Mostapah Mokhtar (Hortonworks) February 20 2015 httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final
bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12
37
Sqoop on Spark (Expected in Sqoop 2)
bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop
bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources
bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal
bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532
38
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop
bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark
39
Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml
40
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
41
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
42
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
43
3 IntegrationService Open Source Tool
StorageServing Layer
Data Formats
Data Ingestion ServicesResource Management
Search
SQL
44
3 Integrationbull Spark was designed to read and write data from and to HDFS as
well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851
45
3 Integrationbull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase
46
3 Integration bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra
47
3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new avenues
bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra
48
3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo
httpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights
bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
49
3 Integration bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu
ction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
50
3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph
databasebull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
51
3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A
bull Some issues are critical ones bull Running Spark on YARN
httpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
52
3 Integrationbull Spark SQL provides built in support for Hive tables
bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib
53
3 Integration bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline
Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
54
3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka
55
3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
56
3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data
bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml
57
3 Integrationbull Apache Parquet is a columnar storage format available
to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
58
3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark
SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem
bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format
59
3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
60
3 Integration bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html
61
3 Integration
bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from
MapReduce to Sparkbull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed
62
3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014
63
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
64
4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them
Hadoop ecosystem Spark ecosystem
65
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
66
4 Complementarity + bull Mesos and YARN can work together each for what
it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom
httpsparkbigdatacomcomponenttagstag41
67
4 Complementarity + References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management
bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620
68
4 Complementarity + bull Spark on Tez for efficient ETL https
githubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)
bull Tez supports enterprise security
69
4 Complementarity + bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
70
4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml
71
4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop
72
5 Key Takeaways1 Evolution of compute models is still ongoing Watch
out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
73
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
74
1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt
he-perfect-match-apache-spark-meets-swift
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
13
4 Analysts bull ldquoAfter hearing the confusion between Spark and Hadoop
one too many times I was inspired to write a report The Hadoop Ecosystem Overview Q4 2104
bull For those that have day jobs that donrsquot include constantly tracking Hadoop evolution I dove in and worked with Hadoop vendors and trusted consultants to create a framework
bull We divided the complex Hadoop ecosystem into a core set of tools that all work closely with data stored in Hadoop File System and extended group of components that leverage but do not require itrdquo Source Elephants Pigs Rhinos and Giraphs Oh My ndash Its Time To Get A Handle On Hadoop Posted by Brian Hopkins on November 26 2014
httpblogsforrestercombrian_hopkins14-11-26-elephants_pigs_rhinos_and_giraphs_oh_my_its_time_to_get_a_handle_on_hadoop
14
5 Key Takeaways1 News Big Data is no longer a Hadoop
monopoly2 Surveys Listen to what Spark developers are
saying 3 Vendors ltHadoop Vendorgt-tinted goggles
FUD is still being lsquoofferedrsquo by some Hadoop vendors Claims need to be contextualized
4 Analysts Thorough understanding of the market dynamics
15
II Big Data Typical Big Data Stack Hadoop Spark
1 Big Data2 Typical Big Data Stack 3 Apache Hadoop4 Apache Spark5 Key Takeaways
16
1 Big Databull Big Data is still one of the most inflated buzzword of the last years
bull Big Data is a broad term for data sets so large or complex that traditional data processing tools are inadequate httpenwikipediaorgwikiBig_data
bull Hadoop is becoming a traditional tool Above definition is inadequate
bull ldquoBig Data refers to datasets and flows large enough that has outpaced our capability to store process analyze and understandrdquo Amir H Payberah Swedish Institute of Computer Science (SICS)
17
2 Typical Big Data Stack
18
3 Apache Hadoopbull Apache Hadoop as an example of a Typical Big Data
Stack bull Hadoop ecosystem = Hadoop Stack + many other tools
(either open source and free or commercial ones)bull Big Data Ecosystem Dataset httpbigdataandreamostosiname
Incomplete but a useful list of Big Data related projects packed into a JSON dataset
bull Hadoops Impact on Data Managements Future - Amr Awadallah (Strata + Hadoop 2015) February 19 2015 Watch video at 236 on lsquoHadoop Isnrsquot Just Hadoop Anymorersquo for a picture representing the evolution of Apache Hadoop httpswwwyoutubecomwatchv=1KvTZZAkHy0
19
4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stackbull Apache Spark provides you Big Data computing and more
bull BYOS Bring Your Own Storage bull BYOC Bring Your Own Clusterbull Spark Core httpsparkbigdatacomcomponenttagstag11-core-sparkbull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-
streamingbull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sqlbull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllibbull GraphX httpsparkbigdatacomcomponenttagstag6-graphx
bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned
20
5 Key Takeaways1 Big Data Still one of the most inflated
buzzword 2 Typical Big Data Stack Big Data Stacks look
similar on paper Arenrsquot they3 Apache Hadoop Hadoop is no longer
lsquosynonymousrsquo of Big Data4 Apache Spark Emergence of the Apache
Spark ecosystem
21
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
22
1 Evolution of Programming APIsbull MapReduce in Java is like assembly code of Big
Data httpwikiapacheorghadoopWordCount
bull Pig httppigapacheorg
bull Hive httphiveapacheorg
bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi
bull Cascading httpwwwcascadingorg
bull Scalding A Scala API for Cascading httptwittercomscalding
bull Crunch httpcrunchapacheorg
bull Scrunch httpcrunchapacheorgscrunchhtml
23
1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007 MapReduce v1 was the only choice as a compute model (Execution Engine) on Hadoop Now we have in addition to MapReduce v2 Tez Spark and Flink
bull Batch bull Batchbull Interactive
bull Batchbull Interactivebull Near-Real
time
bull Batchbull Interactivebull Real-Timebull Iterative
bull 1st Generation
bull 2nd
Generationbull 3rd
Generationbull 4th
Generation
24
1 Evolutionbull This is how Hadoop MapReduce is branding itself ldquoA YARN-based
system for parallel processing of large data sets httphadoopapacheorg
bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip
bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job
bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics
25
1 Evolutionbull Tez Hindi for ldquospeedrdquobull This is how Apache Tez is branding itself ldquoThe Apache Tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data It is currently built atop YARNrdquo
Source httptezapacheorg
bull Apachetrade Tez is an extensible framework for building high performance batch and interactive data processing applications coordinated by YARN in Apache Hadoop
26
1 Evolution bull lsquoSparkrsquo for lightning fast speedbull This is how Apache Spark is branding itself ldquoApache Sparktrade is a fast and general engine for large-scale data processingrdquo httpssparkapacheorg
bull Apache Spark is a general purpose cluster computing framework its execution model supports wide variety of use cases batch interactive near-real time
bull The rapid in-memory processing of resilient distributed datasets (RDDs) is the ldquocore capabilityrdquo of Apache Spark
27
1 Evolution Apache Flinkbull Flink German for ldquonimble swift speedyrdquobull This is how Apache Flink is branding itself ldquoFast and
reliable large-scale data processing enginerdquobull Apache Flink httpflinkapacheorg offers
bull Batch and Streaming in the same systembull Beyond DAGs (Cyclic operator graphs)bull Powerful expressive APIsbull Inside-the-system iterationsbull Full Hadoop compatibility bull Automatic language independent optimizer
bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink
28
Hadoop MapReduce vs Tez vs SparkCriteria
License Open SourceApache 20 version 2x
Open SourceApache 20 version 0x
Open SourceApache 20 version 1x
Processing Model
On-Disk (Disk- based parallelization) Batch
On-Disk Batch Interactive
In-Memory On-Disk Batch Interactive Streaming (Near Real-Time)
Language written in
Java Java Scala
API [Java Python Scala] User-Facing
Java[ ISVEngineTool builder]
[Scala Java Python] User-Facing
Libraries None separate tools None [Spark Core Spark Streaming Spark SQL MLlib GraphX]
29
Hadoop MapReduce vs Tez vs SparkCriteria
Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to Hadoop
Ease of Use Difficult to program needs abstractions
No Interactive mode except Hive Pig
Difficult to program
No Interactive mode except Hive Pig
Easy to program no need of abstractionsInteractive mode
Compatibility
to data types and data sources is same
to data types and data sources is same
to data types and data sources is same
YARN integration
YARN application Ground up YARN application
Spark is moving towards YARN
30
Hadoop MapReduce vs Tez vs SparkCriteria
Deployment YARN YARN [Standalone YARN SIMR Mesos hellip]
Performance - Good performance when data fits into memory
- performance degradation otherwise
Security More features and projects
More features and projects
Still in its infancy
Partial support
31
IV Spark with Hadoop
1 Evolution2 Transition3 Integration4 Complementarity5 Key Takeaways
32
2 Transitionbull Existing Hadoop MapReduce projects can migrate to Spark and leverage Spark Core as execution engine1 You can often reuse your mapper and
reducer functions and just call them in Spark from Java or Scala
2 You can translate your code from MapReduce to Apache Spark How-to Translate from MapReduce to Apache Spark
httpblogclouderacomblog201409how-to-translate-from-mapreduce-to-apache-spark
33
2 Transition3 The following tools originally based on Hadoop MapReduce are being ported to Apache Spark
bull Pig Hive Sqoop Cascading Crunch Mahout hellip
34
Pig on Spark (Spork)bull Run Pig with ldquondashx sparkrdquo option for an easy migration
without development effortbull Speed up your existing pig scripts on Spark ( Query
Logical Plan Physical Pan)bull Leverage new Spark specific operators in Pig such as
Cachebull Still leverage many existing Pig UDF librariesbull Pig on Spark Umbrella Jira (Status Passed end-to-end test
cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059bull Fix outstanding issues and address additional Spark functionality
through the community
bull lsquoPig on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag19
35
Hive on Spark (Currently in Beta Expected in Hive 110)
bull New alternative to using MapReduce or Tez hivegt set hiveexecutionengine=sparkbull Help existing Hive applications running on
MapReduce or Tez easily migrate to Spark without development effort
bull Exposes Spark users to a viable feature-rich de facto standard SQL tool on Hadoop
bull Performance benefits especially for Hive queries involving multiple reducer stages
bull Hive on Spark Umbrella Jira (Status Open) Q1 2015 httpsissuesapacheorgjirabrowseHIVE-7292
36
Hive on Spark (Currently in Beta Expected in Hive 110)
bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-motivations-and-design-principles
bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Started
bull Hive on Spark February 11 2015 Szehon Ho Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark
bull Hive on spark is blazing fast or is it Carter Shanklin and Mostapah Mokhtar (Hortonworks) February 20 2015 httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final
bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12
37
Sqoop on Spark (Expected in Sqoop 2)
bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop
bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources
bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal
bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532
38
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop
bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark
39
Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml
40
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
41
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
42
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
43
3 IntegrationService Open Source Tool
StorageServing Layer
Data Formats
Data Ingestion ServicesResource Management
Search
SQL
44
3 Integrationbull Spark was designed to read and write data from and to HDFS as
well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851
45
3 Integrationbull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase
46
3 Integration bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra
47
3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new avenues
bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra
48
3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo
httpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights
bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
49
3 Integration bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu
ction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
50
3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph
databasebull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
51
3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A
bull Some issues are critical ones bull Running Spark on YARN
httpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
52
3 Integrationbull Spark SQL provides built in support for Hive tables
bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib
53
3 Integration bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline
Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
54
3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka
55
3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
56
3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data
bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml
57
3 Integrationbull Apache Parquet is a columnar storage format available
to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
58
3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark
SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem
bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format
59
3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
60
3 Integration bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html
61
3 Integration
bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from
MapReduce to Sparkbull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed
62
3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014
63
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
64
4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them
Hadoop ecosystem Spark ecosystem
65
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
66
4 Complementarity + bull Mesos and YARN can work together each for what
it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom
httpsparkbigdatacomcomponenttagstag41
67
4 Complementarity + References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management
bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620
68
4 Complementarity + bull Spark on Tez for efficient ETL https
githubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)
bull Tez supports enterprise security
69
4 Complementarity + bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
70
4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml
71
4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop
72
5 Key Takeaways1 Evolution of compute models is still ongoing Watch
out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
73
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
74
1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt
he-perfect-match-apache-spark-meets-swift
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
14
5 Key Takeaways1 News Big Data is no longer a Hadoop
monopoly2 Surveys Listen to what Spark developers are
saying 3 Vendors ltHadoop Vendorgt-tinted goggles
FUD is still being lsquoofferedrsquo by some Hadoop vendors Claims need to be contextualized
4 Analysts Thorough understanding of the market dynamics
15
II Big Data Typical Big Data Stack Hadoop Spark
1 Big Data2 Typical Big Data Stack 3 Apache Hadoop4 Apache Spark5 Key Takeaways
16
1 Big Databull Big Data is still one of the most inflated buzzword of the last years
bull Big Data is a broad term for data sets so large or complex that traditional data processing tools are inadequate httpenwikipediaorgwikiBig_data
bull Hadoop is becoming a traditional tool Above definition is inadequate
bull ldquoBig Data refers to datasets and flows large enough that has outpaced our capability to store process analyze and understandrdquo Amir H Payberah Swedish Institute of Computer Science (SICS)
17
2 Typical Big Data Stack
18
3 Apache Hadoopbull Apache Hadoop as an example of a Typical Big Data
Stack bull Hadoop ecosystem = Hadoop Stack + many other tools
(either open source and free or commercial ones)bull Big Data Ecosystem Dataset httpbigdataandreamostosiname
Incomplete but a useful list of Big Data related projects packed into a JSON dataset
bull Hadoops Impact on Data Managements Future - Amr Awadallah (Strata + Hadoop 2015) February 19 2015 Watch video at 236 on lsquoHadoop Isnrsquot Just Hadoop Anymorersquo for a picture representing the evolution of Apache Hadoop httpswwwyoutubecomwatchv=1KvTZZAkHy0
19
4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stackbull Apache Spark provides you Big Data computing and more
bull BYOS Bring Your Own Storage bull BYOC Bring Your Own Clusterbull Spark Core httpsparkbigdatacomcomponenttagstag11-core-sparkbull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-
streamingbull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sqlbull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllibbull GraphX httpsparkbigdatacomcomponenttagstag6-graphx
bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned
20
5 Key Takeaways1 Big Data Still one of the most inflated
buzzword 2 Typical Big Data Stack Big Data Stacks look
similar on paper Arenrsquot they3 Apache Hadoop Hadoop is no longer
lsquosynonymousrsquo of Big Data4 Apache Spark Emergence of the Apache
Spark ecosystem
21
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
22
1 Evolution of Programming APIsbull MapReduce in Java is like assembly code of Big
Data httpwikiapacheorghadoopWordCount
bull Pig httppigapacheorg
bull Hive httphiveapacheorg
bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi
bull Cascading httpwwwcascadingorg
bull Scalding A Scala API for Cascading httptwittercomscalding
bull Crunch httpcrunchapacheorg
bull Scrunch httpcrunchapacheorgscrunchhtml
23
1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007 MapReduce v1 was the only choice as a compute model (Execution Engine) on Hadoop Now we have in addition to MapReduce v2 Tez Spark and Flink
bull Batch bull Batchbull Interactive
bull Batchbull Interactivebull Near-Real
time
bull Batchbull Interactivebull Real-Timebull Iterative
bull 1st Generation
bull 2nd
Generationbull 3rd
Generationbull 4th
Generation
24
1 Evolutionbull This is how Hadoop MapReduce is branding itself ldquoA YARN-based
system for parallel processing of large data sets httphadoopapacheorg
bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip
bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job
bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics
25
1 Evolutionbull Tez Hindi for ldquospeedrdquobull This is how Apache Tez is branding itself ldquoThe Apache Tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data It is currently built atop YARNrdquo
Source httptezapacheorg
bull Apachetrade Tez is an extensible framework for building high performance batch and interactive data processing applications coordinated by YARN in Apache Hadoop
26
1 Evolution bull lsquoSparkrsquo for lightning fast speedbull This is how Apache Spark is branding itself ldquoApache Sparktrade is a fast and general engine for large-scale data processingrdquo httpssparkapacheorg
bull Apache Spark is a general purpose cluster computing framework its execution model supports wide variety of use cases batch interactive near-real time
bull The rapid in-memory processing of resilient distributed datasets (RDDs) is the ldquocore capabilityrdquo of Apache Spark
27
1 Evolution Apache Flinkbull Flink German for ldquonimble swift speedyrdquobull This is how Apache Flink is branding itself ldquoFast and
reliable large-scale data processing enginerdquobull Apache Flink httpflinkapacheorg offers
bull Batch and Streaming in the same systembull Beyond DAGs (Cyclic operator graphs)bull Powerful expressive APIsbull Inside-the-system iterationsbull Full Hadoop compatibility bull Automatic language independent optimizer
bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink
28
Hadoop MapReduce vs Tez vs SparkCriteria
License Open SourceApache 20 version 2x
Open SourceApache 20 version 0x
Open SourceApache 20 version 1x
Processing Model
On-Disk (Disk- based parallelization) Batch
On-Disk Batch Interactive
In-Memory On-Disk Batch Interactive Streaming (Near Real-Time)
Language written in
Java Java Scala
API [Java Python Scala] User-Facing
Java[ ISVEngineTool builder]
[Scala Java Python] User-Facing
Libraries None separate tools None [Spark Core Spark Streaming Spark SQL MLlib GraphX]
29
Hadoop MapReduce vs Tez vs SparkCriteria
Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to Hadoop
Ease of Use Difficult to program needs abstractions
No Interactive mode except Hive Pig
Difficult to program
No Interactive mode except Hive Pig
Easy to program no need of abstractionsInteractive mode
Compatibility
to data types and data sources is same
to data types and data sources is same
to data types and data sources is same
YARN integration
YARN application Ground up YARN application
Spark is moving towards YARN
30
Hadoop MapReduce vs Tez vs SparkCriteria
Deployment YARN YARN [Standalone YARN SIMR Mesos hellip]
Performance - Good performance when data fits into memory
- performance degradation otherwise
Security More features and projects
More features and projects
Still in its infancy
Partial support
31
IV Spark with Hadoop
1 Evolution2 Transition3 Integration4 Complementarity5 Key Takeaways
32
2 Transitionbull Existing Hadoop MapReduce projects can migrate to Spark and leverage Spark Core as execution engine1 You can often reuse your mapper and
reducer functions and just call them in Spark from Java or Scala
2 You can translate your code from MapReduce to Apache Spark How-to Translate from MapReduce to Apache Spark
httpblogclouderacomblog201409how-to-translate-from-mapreduce-to-apache-spark
33
2 Transition3 The following tools originally based on Hadoop MapReduce are being ported to Apache Spark
bull Pig Hive Sqoop Cascading Crunch Mahout hellip
34
Pig on Spark (Spork)bull Run Pig with ldquondashx sparkrdquo option for an easy migration
without development effortbull Speed up your existing pig scripts on Spark ( Query
Logical Plan Physical Pan)bull Leverage new Spark specific operators in Pig such as
Cachebull Still leverage many existing Pig UDF librariesbull Pig on Spark Umbrella Jira (Status Passed end-to-end test
cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059bull Fix outstanding issues and address additional Spark functionality
through the community
bull lsquoPig on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag19
35
Hive on Spark (Currently in Beta Expected in Hive 110)
bull New alternative to using MapReduce or Tez hivegt set hiveexecutionengine=sparkbull Help existing Hive applications running on
MapReduce or Tez easily migrate to Spark without development effort
bull Exposes Spark users to a viable feature-rich de facto standard SQL tool on Hadoop
bull Performance benefits especially for Hive queries involving multiple reducer stages
bull Hive on Spark Umbrella Jira (Status Open) Q1 2015 httpsissuesapacheorgjirabrowseHIVE-7292
36
Hive on Spark (Currently in Beta Expected in Hive 110)
bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-motivations-and-design-principles
bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Started
bull Hive on Spark February 11 2015 Szehon Ho Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark
bull Hive on spark is blazing fast or is it Carter Shanklin and Mostapah Mokhtar (Hortonworks) February 20 2015 httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final
bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12
37
Sqoop on Spark (Expected in Sqoop 2)
bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop
bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources
bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal
bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532
38
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop
bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark
39
Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml
40
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
41
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
42
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
43
3 IntegrationService Open Source Tool
StorageServing Layer
Data Formats
Data Ingestion ServicesResource Management
Search
SQL
44
3 Integrationbull Spark was designed to read and write data from and to HDFS as
well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851
45
3 Integrationbull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase
46
3 Integration bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra
47
3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new avenues
bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra
48
3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo
httpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights
bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
49
3 Integration bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu
ction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
50
3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph
databasebull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
51
3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A
bull Some issues are critical ones bull Running Spark on YARN
httpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
52
3 Integrationbull Spark SQL provides built in support for Hive tables
bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib
53
3 Integration bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline
Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
54
3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka
55
3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
56
3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data
bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml
57
3 Integrationbull Apache Parquet is a columnar storage format available
to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
58
3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark
SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem
bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format
59
3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
60
3 Integration bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html
61
3 Integration
bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from
MapReduce to Sparkbull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed
62
3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014
63
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
64
4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them
Hadoop ecosystem Spark ecosystem
65
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
66
4 Complementarity + bull Mesos and YARN can work together each for what
it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom
httpsparkbigdatacomcomponenttagstag41
67
4 Complementarity + References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management
bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620
68
4 Complementarity + bull Spark on Tez for efficient ETL https
githubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)
bull Tez supports enterprise security
69
4 Complementarity + bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
70
4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml
71
4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop
72
5 Key Takeaways1 Evolution of compute models is still ongoing Watch
out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
73
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
74
1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt
he-perfect-match-apache-spark-meets-swift
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
15
II Big Data Typical Big Data Stack Hadoop Spark
1 Big Data2 Typical Big Data Stack 3 Apache Hadoop4 Apache Spark5 Key Takeaways
16
1 Big Databull Big Data is still one of the most inflated buzzword of the last years
bull Big Data is a broad term for data sets so large or complex that traditional data processing tools are inadequate httpenwikipediaorgwikiBig_data
bull Hadoop is becoming a traditional tool Above definition is inadequate
bull ldquoBig Data refers to datasets and flows large enough that has outpaced our capability to store process analyze and understandrdquo Amir H Payberah Swedish Institute of Computer Science (SICS)
17
2 Typical Big Data Stack
18
3 Apache Hadoopbull Apache Hadoop as an example of a Typical Big Data
Stack bull Hadoop ecosystem = Hadoop Stack + many other tools
(either open source and free or commercial ones)bull Big Data Ecosystem Dataset httpbigdataandreamostosiname
Incomplete but a useful list of Big Data related projects packed into a JSON dataset
bull Hadoops Impact on Data Managements Future - Amr Awadallah (Strata + Hadoop 2015) February 19 2015 Watch video at 236 on lsquoHadoop Isnrsquot Just Hadoop Anymorersquo for a picture representing the evolution of Apache Hadoop httpswwwyoutubecomwatchv=1KvTZZAkHy0
19
4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stackbull Apache Spark provides you Big Data computing and more
bull BYOS Bring Your Own Storage bull BYOC Bring Your Own Clusterbull Spark Core httpsparkbigdatacomcomponenttagstag11-core-sparkbull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-
streamingbull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sqlbull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllibbull GraphX httpsparkbigdatacomcomponenttagstag6-graphx
bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned
20
5 Key Takeaways1 Big Data Still one of the most inflated
buzzword 2 Typical Big Data Stack Big Data Stacks look
similar on paper Arenrsquot they3 Apache Hadoop Hadoop is no longer
lsquosynonymousrsquo of Big Data4 Apache Spark Emergence of the Apache
Spark ecosystem
21
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
22
1 Evolution of Programming APIsbull MapReduce in Java is like assembly code of Big
Data httpwikiapacheorghadoopWordCount
bull Pig httppigapacheorg
bull Hive httphiveapacheorg
bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi
bull Cascading httpwwwcascadingorg
bull Scalding A Scala API for Cascading httptwittercomscalding
bull Crunch httpcrunchapacheorg
bull Scrunch httpcrunchapacheorgscrunchhtml
23
1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007 MapReduce v1 was the only choice as a compute model (Execution Engine) on Hadoop Now we have in addition to MapReduce v2 Tez Spark and Flink
bull Batch bull Batchbull Interactive
bull Batchbull Interactivebull Near-Real
time
bull Batchbull Interactivebull Real-Timebull Iterative
bull 1st Generation
bull 2nd
Generationbull 3rd
Generationbull 4th
Generation
24
1 Evolutionbull This is how Hadoop MapReduce is branding itself ldquoA YARN-based
system for parallel processing of large data sets httphadoopapacheorg
bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip
bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job
bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics
25
1 Evolutionbull Tez Hindi for ldquospeedrdquobull This is how Apache Tez is branding itself ldquoThe Apache Tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data It is currently built atop YARNrdquo
Source httptezapacheorg
bull Apachetrade Tez is an extensible framework for building high performance batch and interactive data processing applications coordinated by YARN in Apache Hadoop
26
1 Evolution bull lsquoSparkrsquo for lightning fast speedbull This is how Apache Spark is branding itself ldquoApache Sparktrade is a fast and general engine for large-scale data processingrdquo httpssparkapacheorg
bull Apache Spark is a general purpose cluster computing framework its execution model supports wide variety of use cases batch interactive near-real time
bull The rapid in-memory processing of resilient distributed datasets (RDDs) is the ldquocore capabilityrdquo of Apache Spark
27
1 Evolution Apache Flinkbull Flink German for ldquonimble swift speedyrdquobull This is how Apache Flink is branding itself ldquoFast and
reliable large-scale data processing enginerdquobull Apache Flink httpflinkapacheorg offers
bull Batch and Streaming in the same systembull Beyond DAGs (Cyclic operator graphs)bull Powerful expressive APIsbull Inside-the-system iterationsbull Full Hadoop compatibility bull Automatic language independent optimizer
bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink
28
Hadoop MapReduce vs Tez vs SparkCriteria
License Open SourceApache 20 version 2x
Open SourceApache 20 version 0x
Open SourceApache 20 version 1x
Processing Model
On-Disk (Disk- based parallelization) Batch
On-Disk Batch Interactive
In-Memory On-Disk Batch Interactive Streaming (Near Real-Time)
Language written in
Java Java Scala
API [Java Python Scala] User-Facing
Java[ ISVEngineTool builder]
[Scala Java Python] User-Facing
Libraries None separate tools None [Spark Core Spark Streaming Spark SQL MLlib GraphX]
29
Hadoop MapReduce vs Tez vs SparkCriteria
Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to Hadoop
Ease of Use Difficult to program needs abstractions
No Interactive mode except Hive Pig
Difficult to program
No Interactive mode except Hive Pig
Easy to program no need of abstractionsInteractive mode
Compatibility
to data types and data sources is same
to data types and data sources is same
to data types and data sources is same
YARN integration
YARN application Ground up YARN application
Spark is moving towards YARN
30
Hadoop MapReduce vs Tez vs SparkCriteria
Deployment YARN YARN [Standalone YARN SIMR Mesos hellip]
Performance - Good performance when data fits into memory
- performance degradation otherwise
Security More features and projects
More features and projects
Still in its infancy
Partial support
31
IV Spark with Hadoop
1 Evolution2 Transition3 Integration4 Complementarity5 Key Takeaways
32
2 Transitionbull Existing Hadoop MapReduce projects can migrate to Spark and leverage Spark Core as execution engine1 You can often reuse your mapper and
reducer functions and just call them in Spark from Java or Scala
2 You can translate your code from MapReduce to Apache Spark How-to Translate from MapReduce to Apache Spark
httpblogclouderacomblog201409how-to-translate-from-mapreduce-to-apache-spark
33
2 Transition3 The following tools originally based on Hadoop MapReduce are being ported to Apache Spark
bull Pig Hive Sqoop Cascading Crunch Mahout hellip
34
Pig on Spark (Spork)bull Run Pig with ldquondashx sparkrdquo option for an easy migration
without development effortbull Speed up your existing pig scripts on Spark ( Query
Logical Plan Physical Pan)bull Leverage new Spark specific operators in Pig such as
Cachebull Still leverage many existing Pig UDF librariesbull Pig on Spark Umbrella Jira (Status Passed end-to-end test
cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059bull Fix outstanding issues and address additional Spark functionality
through the community
bull lsquoPig on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag19
35
Hive on Spark (Currently in Beta Expected in Hive 110)
bull New alternative to using MapReduce or Tez hivegt set hiveexecutionengine=sparkbull Help existing Hive applications running on
MapReduce or Tez easily migrate to Spark without development effort
bull Exposes Spark users to a viable feature-rich de facto standard SQL tool on Hadoop
bull Performance benefits especially for Hive queries involving multiple reducer stages
bull Hive on Spark Umbrella Jira (Status Open) Q1 2015 httpsissuesapacheorgjirabrowseHIVE-7292
36
Hive on Spark (Currently in Beta Expected in Hive 110)
bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-motivations-and-design-principles
bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Started
bull Hive on Spark February 11 2015 Szehon Ho Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark
bull Hive on spark is blazing fast or is it Carter Shanklin and Mostapah Mokhtar (Hortonworks) February 20 2015 httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final
bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12
37
Sqoop on Spark (Expected in Sqoop 2)
bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop
bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources
bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal
bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532
38
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop
bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark
39
Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml
40
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
41
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
42
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
43
3 IntegrationService Open Source Tool
StorageServing Layer
Data Formats
Data Ingestion ServicesResource Management
Search
SQL
44
3 Integrationbull Spark was designed to read and write data from and to HDFS as
well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851
45
3 Integrationbull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase
46
3 Integration bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra
47
3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new avenues
bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra
48
3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo
httpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights
bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
49
3 Integration bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu
ction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
50
3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph
databasebull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
51
3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A
bull Some issues are critical ones bull Running Spark on YARN
httpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
52
3 Integrationbull Spark SQL provides built in support for Hive tables
bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib
53
3 Integration bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline
Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
54
3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka
55
3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
56
3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data
bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml
57
3 Integrationbull Apache Parquet is a columnar storage format available
to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
58
3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark
SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem
bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format
59
3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
60
3 Integration bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html
61
3 Integration
bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from
MapReduce to Sparkbull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed
62
3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014
63
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
64
4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them
Hadoop ecosystem Spark ecosystem
65
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
66
4 Complementarity + bull Mesos and YARN can work together each for what
it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom
httpsparkbigdatacomcomponenttagstag41
67
4 Complementarity + References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management
bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620
68
4 Complementarity + bull Spark on Tez for efficient ETL https
githubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)
bull Tez supports enterprise security
69
4 Complementarity + bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
70
4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml
71
4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop
72
5 Key Takeaways1 Evolution of compute models is still ongoing Watch
out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
73
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
74
1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt
he-perfect-match-apache-spark-meets-swift
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
16
1 Big Databull Big Data is still one of the most inflated buzzword of the last years
bull Big Data is a broad term for data sets so large or complex that traditional data processing tools are inadequate httpenwikipediaorgwikiBig_data
bull Hadoop is becoming a traditional tool Above definition is inadequate
bull ldquoBig Data refers to datasets and flows large enough that has outpaced our capability to store process analyze and understandrdquo Amir H Payberah Swedish Institute of Computer Science (SICS)
17
2 Typical Big Data Stack
18
3 Apache Hadoopbull Apache Hadoop as an example of a Typical Big Data
Stack bull Hadoop ecosystem = Hadoop Stack + many other tools
(either open source and free or commercial ones)bull Big Data Ecosystem Dataset httpbigdataandreamostosiname
Incomplete but a useful list of Big Data related projects packed into a JSON dataset
bull Hadoops Impact on Data Managements Future - Amr Awadallah (Strata + Hadoop 2015) February 19 2015 Watch video at 236 on lsquoHadoop Isnrsquot Just Hadoop Anymorersquo for a picture representing the evolution of Apache Hadoop httpswwwyoutubecomwatchv=1KvTZZAkHy0
19
4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stackbull Apache Spark provides you Big Data computing and more
bull BYOS Bring Your Own Storage bull BYOC Bring Your Own Clusterbull Spark Core httpsparkbigdatacomcomponenttagstag11-core-sparkbull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-
streamingbull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sqlbull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllibbull GraphX httpsparkbigdatacomcomponenttagstag6-graphx
bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned
20
5 Key Takeaways1 Big Data Still one of the most inflated
buzzword 2 Typical Big Data Stack Big Data Stacks look
similar on paper Arenrsquot they3 Apache Hadoop Hadoop is no longer
lsquosynonymousrsquo of Big Data4 Apache Spark Emergence of the Apache
Spark ecosystem
21
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
22
1 Evolution of Programming APIsbull MapReduce in Java is like assembly code of Big
Data httpwikiapacheorghadoopWordCount
bull Pig httppigapacheorg
bull Hive httphiveapacheorg
bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi
bull Cascading httpwwwcascadingorg
bull Scalding A Scala API for Cascading httptwittercomscalding
bull Crunch httpcrunchapacheorg
bull Scrunch httpcrunchapacheorgscrunchhtml
23
1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007 MapReduce v1 was the only choice as a compute model (Execution Engine) on Hadoop Now we have in addition to MapReduce v2 Tez Spark and Flink
bull Batch bull Batchbull Interactive
bull Batchbull Interactivebull Near-Real
time
bull Batchbull Interactivebull Real-Timebull Iterative
bull 1st Generation
bull 2nd
Generationbull 3rd
Generationbull 4th
Generation
24
1 Evolutionbull This is how Hadoop MapReduce is branding itself ldquoA YARN-based
system for parallel processing of large data sets httphadoopapacheorg
bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip
bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job
bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics
25
1 Evolutionbull Tez Hindi for ldquospeedrdquobull This is how Apache Tez is branding itself ldquoThe Apache Tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data It is currently built atop YARNrdquo
Source httptezapacheorg
bull Apachetrade Tez is an extensible framework for building high performance batch and interactive data processing applications coordinated by YARN in Apache Hadoop
26
1 Evolution bull lsquoSparkrsquo for lightning fast speedbull This is how Apache Spark is branding itself ldquoApache Sparktrade is a fast and general engine for large-scale data processingrdquo httpssparkapacheorg
bull Apache Spark is a general purpose cluster computing framework its execution model supports wide variety of use cases batch interactive near-real time
bull The rapid in-memory processing of resilient distributed datasets (RDDs) is the ldquocore capabilityrdquo of Apache Spark
27
1 Evolution Apache Flinkbull Flink German for ldquonimble swift speedyrdquobull This is how Apache Flink is branding itself ldquoFast and
reliable large-scale data processing enginerdquobull Apache Flink httpflinkapacheorg offers
bull Batch and Streaming in the same systembull Beyond DAGs (Cyclic operator graphs)bull Powerful expressive APIsbull Inside-the-system iterationsbull Full Hadoop compatibility bull Automatic language independent optimizer
bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink
28
Hadoop MapReduce vs Tez vs SparkCriteria
License Open SourceApache 20 version 2x
Open SourceApache 20 version 0x
Open SourceApache 20 version 1x
Processing Model
On-Disk (Disk- based parallelization) Batch
On-Disk Batch Interactive
In-Memory On-Disk Batch Interactive Streaming (Near Real-Time)
Language written in
Java Java Scala
API [Java Python Scala] User-Facing
Java[ ISVEngineTool builder]
[Scala Java Python] User-Facing
Libraries None separate tools None [Spark Core Spark Streaming Spark SQL MLlib GraphX]
29
Hadoop MapReduce vs Tez vs SparkCriteria
Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to Hadoop
Ease of Use Difficult to program needs abstractions
No Interactive mode except Hive Pig
Difficult to program
No Interactive mode except Hive Pig
Easy to program no need of abstractionsInteractive mode
Compatibility
to data types and data sources is same
to data types and data sources is same
to data types and data sources is same
YARN integration
YARN application Ground up YARN application
Spark is moving towards YARN
30
Hadoop MapReduce vs Tez vs SparkCriteria
Deployment YARN YARN [Standalone YARN SIMR Mesos hellip]
Performance - Good performance when data fits into memory
- performance degradation otherwise
Security More features and projects
More features and projects
Still in its infancy
Partial support
31
IV Spark with Hadoop
1 Evolution2 Transition3 Integration4 Complementarity5 Key Takeaways
32
2 Transitionbull Existing Hadoop MapReduce projects can migrate to Spark and leverage Spark Core as execution engine1 You can often reuse your mapper and
reducer functions and just call them in Spark from Java or Scala
2 You can translate your code from MapReduce to Apache Spark How-to Translate from MapReduce to Apache Spark
httpblogclouderacomblog201409how-to-translate-from-mapreduce-to-apache-spark
33
2 Transition3 The following tools originally based on Hadoop MapReduce are being ported to Apache Spark
bull Pig Hive Sqoop Cascading Crunch Mahout hellip
34
Pig on Spark (Spork)bull Run Pig with ldquondashx sparkrdquo option for an easy migration
without development effortbull Speed up your existing pig scripts on Spark ( Query
Logical Plan Physical Pan)bull Leverage new Spark specific operators in Pig such as
Cachebull Still leverage many existing Pig UDF librariesbull Pig on Spark Umbrella Jira (Status Passed end-to-end test
cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059bull Fix outstanding issues and address additional Spark functionality
through the community
bull lsquoPig on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag19
35
Hive on Spark (Currently in Beta Expected in Hive 110)
bull New alternative to using MapReduce or Tez hivegt set hiveexecutionengine=sparkbull Help existing Hive applications running on
MapReduce or Tez easily migrate to Spark without development effort
bull Exposes Spark users to a viable feature-rich de facto standard SQL tool on Hadoop
bull Performance benefits especially for Hive queries involving multiple reducer stages
bull Hive on Spark Umbrella Jira (Status Open) Q1 2015 httpsissuesapacheorgjirabrowseHIVE-7292
36
Hive on Spark (Currently in Beta Expected in Hive 110)
bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-motivations-and-design-principles
bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Started
bull Hive on Spark February 11 2015 Szehon Ho Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark
bull Hive on spark is blazing fast or is it Carter Shanklin and Mostapah Mokhtar (Hortonworks) February 20 2015 httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final
bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12
37
Sqoop on Spark (Expected in Sqoop 2)
bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop
bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources
bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal
bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532
38
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop
bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark
39
Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml
40
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
41
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
42
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
43
3 IntegrationService Open Source Tool
StorageServing Layer
Data Formats
Data Ingestion ServicesResource Management
Search
SQL
44
3 Integrationbull Spark was designed to read and write data from and to HDFS as
well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851
45
3 Integrationbull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase
46
3 Integration bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra
47
3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new avenues
bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra
48
3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo
httpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights
bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
49
3 Integration bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu
ction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
50
3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph
databasebull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
51
3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A
bull Some issues are critical ones bull Running Spark on YARN
httpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
52
3 Integrationbull Spark SQL provides built in support for Hive tables
bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib
53
3 Integration bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline
Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
54
3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka
55
3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
56
3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data
bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml
57
3 Integrationbull Apache Parquet is a columnar storage format available
to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
58
3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark
SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem
bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format
59
3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
60
3 Integration bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html
61
3 Integration
bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from
MapReduce to Sparkbull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed
62
3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014
63
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
64
4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them
Hadoop ecosystem Spark ecosystem
65
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
66
4 Complementarity + bull Mesos and YARN can work together each for what
it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom
httpsparkbigdatacomcomponenttagstag41
67
4 Complementarity + References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management
bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620
68
4 Complementarity + bull Spark on Tez for efficient ETL https
githubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)
bull Tez supports enterprise security
69
4 Complementarity + bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
70
4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml
71
4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop
72
5 Key Takeaways1 Evolution of compute models is still ongoing Watch
out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
73
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
74
1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt
he-perfect-match-apache-spark-meets-swift
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
17
2 Typical Big Data Stack
18
3 Apache Hadoopbull Apache Hadoop as an example of a Typical Big Data
Stack bull Hadoop ecosystem = Hadoop Stack + many other tools
(either open source and free or commercial ones)bull Big Data Ecosystem Dataset httpbigdataandreamostosiname
Incomplete but a useful list of Big Data related projects packed into a JSON dataset
bull Hadoops Impact on Data Managements Future - Amr Awadallah (Strata + Hadoop 2015) February 19 2015 Watch video at 236 on lsquoHadoop Isnrsquot Just Hadoop Anymorersquo for a picture representing the evolution of Apache Hadoop httpswwwyoutubecomwatchv=1KvTZZAkHy0
19
4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stackbull Apache Spark provides you Big Data computing and more
bull BYOS Bring Your Own Storage bull BYOC Bring Your Own Clusterbull Spark Core httpsparkbigdatacomcomponenttagstag11-core-sparkbull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-
streamingbull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sqlbull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllibbull GraphX httpsparkbigdatacomcomponenttagstag6-graphx
bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned
20
5 Key Takeaways1 Big Data Still one of the most inflated
buzzword 2 Typical Big Data Stack Big Data Stacks look
similar on paper Arenrsquot they3 Apache Hadoop Hadoop is no longer
lsquosynonymousrsquo of Big Data4 Apache Spark Emergence of the Apache
Spark ecosystem
21
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
22
1 Evolution of Programming APIsbull MapReduce in Java is like assembly code of Big
Data httpwikiapacheorghadoopWordCount
bull Pig httppigapacheorg
bull Hive httphiveapacheorg
bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi
bull Cascading httpwwwcascadingorg
bull Scalding A Scala API for Cascading httptwittercomscalding
bull Crunch httpcrunchapacheorg
bull Scrunch httpcrunchapacheorgscrunchhtml
23
1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007 MapReduce v1 was the only choice as a compute model (Execution Engine) on Hadoop Now we have in addition to MapReduce v2 Tez Spark and Flink
bull Batch bull Batchbull Interactive
bull Batchbull Interactivebull Near-Real
time
bull Batchbull Interactivebull Real-Timebull Iterative
bull 1st Generation
bull 2nd
Generationbull 3rd
Generationbull 4th
Generation
24
1 Evolutionbull This is how Hadoop MapReduce is branding itself ldquoA YARN-based
system for parallel processing of large data sets httphadoopapacheorg
bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip
bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job
bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics
25
1 Evolutionbull Tez Hindi for ldquospeedrdquobull This is how Apache Tez is branding itself ldquoThe Apache Tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data It is currently built atop YARNrdquo
Source httptezapacheorg
bull Apachetrade Tez is an extensible framework for building high performance batch and interactive data processing applications coordinated by YARN in Apache Hadoop
26
1 Evolution bull lsquoSparkrsquo for lightning fast speedbull This is how Apache Spark is branding itself ldquoApache Sparktrade is a fast and general engine for large-scale data processingrdquo httpssparkapacheorg
bull Apache Spark is a general purpose cluster computing framework its execution model supports wide variety of use cases batch interactive near-real time
bull The rapid in-memory processing of resilient distributed datasets (RDDs) is the ldquocore capabilityrdquo of Apache Spark
27
1 Evolution Apache Flinkbull Flink German for ldquonimble swift speedyrdquobull This is how Apache Flink is branding itself ldquoFast and
reliable large-scale data processing enginerdquobull Apache Flink httpflinkapacheorg offers
bull Batch and Streaming in the same systembull Beyond DAGs (Cyclic operator graphs)bull Powerful expressive APIsbull Inside-the-system iterationsbull Full Hadoop compatibility bull Automatic language independent optimizer
bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink
28
Hadoop MapReduce vs Tez vs SparkCriteria
License Open SourceApache 20 version 2x
Open SourceApache 20 version 0x
Open SourceApache 20 version 1x
Processing Model
On-Disk (Disk- based parallelization) Batch
On-Disk Batch Interactive
In-Memory On-Disk Batch Interactive Streaming (Near Real-Time)
Language written in
Java Java Scala
API [Java Python Scala] User-Facing
Java[ ISVEngineTool builder]
[Scala Java Python] User-Facing
Libraries None separate tools None [Spark Core Spark Streaming Spark SQL MLlib GraphX]
29
Hadoop MapReduce vs Tez vs SparkCriteria
Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to Hadoop
Ease of Use Difficult to program needs abstractions
No Interactive mode except Hive Pig
Difficult to program
No Interactive mode except Hive Pig
Easy to program no need of abstractionsInteractive mode
Compatibility
to data types and data sources is same
to data types and data sources is same
to data types and data sources is same
YARN integration
YARN application Ground up YARN application
Spark is moving towards YARN
30
Hadoop MapReduce vs Tez vs SparkCriteria
Deployment YARN YARN [Standalone YARN SIMR Mesos hellip]
Performance - Good performance when data fits into memory
- performance degradation otherwise
Security More features and projects
More features and projects
Still in its infancy
Partial support
31
IV Spark with Hadoop
1 Evolution2 Transition3 Integration4 Complementarity5 Key Takeaways
32
2 Transitionbull Existing Hadoop MapReduce projects can migrate to Spark and leverage Spark Core as execution engine1 You can often reuse your mapper and
reducer functions and just call them in Spark from Java or Scala
2 You can translate your code from MapReduce to Apache Spark How-to Translate from MapReduce to Apache Spark
httpblogclouderacomblog201409how-to-translate-from-mapreduce-to-apache-spark
33
2 Transition3 The following tools originally based on Hadoop MapReduce are being ported to Apache Spark
bull Pig Hive Sqoop Cascading Crunch Mahout hellip
34
Pig on Spark (Spork)bull Run Pig with ldquondashx sparkrdquo option for an easy migration
without development effortbull Speed up your existing pig scripts on Spark ( Query
Logical Plan Physical Pan)bull Leverage new Spark specific operators in Pig such as
Cachebull Still leverage many existing Pig UDF librariesbull Pig on Spark Umbrella Jira (Status Passed end-to-end test
cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059bull Fix outstanding issues and address additional Spark functionality
through the community
bull lsquoPig on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag19
35
Hive on Spark (Currently in Beta Expected in Hive 110)
bull New alternative to using MapReduce or Tez hivegt set hiveexecutionengine=sparkbull Help existing Hive applications running on
MapReduce or Tez easily migrate to Spark without development effort
bull Exposes Spark users to a viable feature-rich de facto standard SQL tool on Hadoop
bull Performance benefits especially for Hive queries involving multiple reducer stages
bull Hive on Spark Umbrella Jira (Status Open) Q1 2015 httpsissuesapacheorgjirabrowseHIVE-7292
36
Hive on Spark (Currently in Beta Expected in Hive 110)
bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-motivations-and-design-principles
bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Started
bull Hive on Spark February 11 2015 Szehon Ho Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark
bull Hive on spark is blazing fast or is it Carter Shanklin and Mostapah Mokhtar (Hortonworks) February 20 2015 httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final
bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12
37
Sqoop on Spark (Expected in Sqoop 2)
bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop
bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources
bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal
bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532
38
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop
bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark
39
Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml
40
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
41
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
42
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
43
3 IntegrationService Open Source Tool
StorageServing Layer
Data Formats
Data Ingestion ServicesResource Management
Search
SQL
44
3 Integrationbull Spark was designed to read and write data from and to HDFS as
well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851
45
3 Integrationbull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase
46
3 Integration bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra
47
3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new avenues
bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra
48
3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo
httpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights
bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
49
3 Integration bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu
ction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
50
3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph
databasebull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
51
3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A
bull Some issues are critical ones bull Running Spark on YARN
httpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
52
3 Integrationbull Spark SQL provides built in support for Hive tables
bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib
53
3 Integration bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline
Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
54
3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka
55
3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
56
3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data
bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml
57
3 Integrationbull Apache Parquet is a columnar storage format available
to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
58
3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark
SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem
bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format
59
3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
60
3 Integration bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html
61
3 Integration
bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from
MapReduce to Sparkbull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed
62
3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014
63
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
64
4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them
Hadoop ecosystem Spark ecosystem
65
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
66
4 Complementarity + bull Mesos and YARN can work together each for what
it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom
httpsparkbigdatacomcomponenttagstag41
67
4 Complementarity + References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management
bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620
68
4 Complementarity + bull Spark on Tez for efficient ETL https
githubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)
bull Tez supports enterprise security
69
4 Complementarity + bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
70
4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml
71
4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop
72
5 Key Takeaways1 Evolution of compute models is still ongoing Watch
out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
73
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
74
1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt
he-perfect-match-apache-spark-meets-swift
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
18
3 Apache Hadoopbull Apache Hadoop as an example of a Typical Big Data
Stack bull Hadoop ecosystem = Hadoop Stack + many other tools
(either open source and free or commercial ones)bull Big Data Ecosystem Dataset httpbigdataandreamostosiname
Incomplete but a useful list of Big Data related projects packed into a JSON dataset
bull Hadoops Impact on Data Managements Future - Amr Awadallah (Strata + Hadoop 2015) February 19 2015 Watch video at 236 on lsquoHadoop Isnrsquot Just Hadoop Anymorersquo for a picture representing the evolution of Apache Hadoop httpswwwyoutubecomwatchv=1KvTZZAkHy0
19
4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stackbull Apache Spark provides you Big Data computing and more
bull BYOS Bring Your Own Storage bull BYOC Bring Your Own Clusterbull Spark Core httpsparkbigdatacomcomponenttagstag11-core-sparkbull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-
streamingbull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sqlbull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllibbull GraphX httpsparkbigdatacomcomponenttagstag6-graphx
bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned
20
5 Key Takeaways1 Big Data Still one of the most inflated
buzzword 2 Typical Big Data Stack Big Data Stacks look
similar on paper Arenrsquot they3 Apache Hadoop Hadoop is no longer
lsquosynonymousrsquo of Big Data4 Apache Spark Emergence of the Apache
Spark ecosystem
21
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
22
1 Evolution of Programming APIsbull MapReduce in Java is like assembly code of Big
Data httpwikiapacheorghadoopWordCount
bull Pig httppigapacheorg
bull Hive httphiveapacheorg
bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi
bull Cascading httpwwwcascadingorg
bull Scalding A Scala API for Cascading httptwittercomscalding
bull Crunch httpcrunchapacheorg
bull Scrunch httpcrunchapacheorgscrunchhtml
23
1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007 MapReduce v1 was the only choice as a compute model (Execution Engine) on Hadoop Now we have in addition to MapReduce v2 Tez Spark and Flink
bull Batch bull Batchbull Interactive
bull Batchbull Interactivebull Near-Real
time
bull Batchbull Interactivebull Real-Timebull Iterative
bull 1st Generation
bull 2nd
Generationbull 3rd
Generationbull 4th
Generation
24
1 Evolutionbull This is how Hadoop MapReduce is branding itself ldquoA YARN-based
system for parallel processing of large data sets httphadoopapacheorg
bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip
bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job
bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics
25
1 Evolutionbull Tez Hindi for ldquospeedrdquobull This is how Apache Tez is branding itself ldquoThe Apache Tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data It is currently built atop YARNrdquo
Source httptezapacheorg
bull Apachetrade Tez is an extensible framework for building high performance batch and interactive data processing applications coordinated by YARN in Apache Hadoop
26
1 Evolution bull lsquoSparkrsquo for lightning fast speedbull This is how Apache Spark is branding itself ldquoApache Sparktrade is a fast and general engine for large-scale data processingrdquo httpssparkapacheorg
bull Apache Spark is a general purpose cluster computing framework its execution model supports wide variety of use cases batch interactive near-real time
bull The rapid in-memory processing of resilient distributed datasets (RDDs) is the ldquocore capabilityrdquo of Apache Spark
27
1 Evolution Apache Flinkbull Flink German for ldquonimble swift speedyrdquobull This is how Apache Flink is branding itself ldquoFast and
reliable large-scale data processing enginerdquobull Apache Flink httpflinkapacheorg offers
bull Batch and Streaming in the same systembull Beyond DAGs (Cyclic operator graphs)bull Powerful expressive APIsbull Inside-the-system iterationsbull Full Hadoop compatibility bull Automatic language independent optimizer
bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink
28
Hadoop MapReduce vs Tez vs SparkCriteria
License Open SourceApache 20 version 2x
Open SourceApache 20 version 0x
Open SourceApache 20 version 1x
Processing Model
On-Disk (Disk- based parallelization) Batch
On-Disk Batch Interactive
In-Memory On-Disk Batch Interactive Streaming (Near Real-Time)
Language written in
Java Java Scala
API [Java Python Scala] User-Facing
Java[ ISVEngineTool builder]
[Scala Java Python] User-Facing
Libraries None separate tools None [Spark Core Spark Streaming Spark SQL MLlib GraphX]
29
Hadoop MapReduce vs Tez vs SparkCriteria
Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to Hadoop
Ease of Use Difficult to program needs abstractions
No Interactive mode except Hive Pig
Difficult to program
No Interactive mode except Hive Pig
Easy to program no need of abstractionsInteractive mode
Compatibility
to data types and data sources is same
to data types and data sources is same
to data types and data sources is same
YARN integration
YARN application Ground up YARN application
Spark is moving towards YARN
30
Hadoop MapReduce vs Tez vs SparkCriteria
Deployment YARN YARN [Standalone YARN SIMR Mesos hellip]
Performance - Good performance when data fits into memory
- performance degradation otherwise
Security More features and projects
More features and projects
Still in its infancy
Partial support
31
IV Spark with Hadoop
1 Evolution2 Transition3 Integration4 Complementarity5 Key Takeaways
32
2 Transitionbull Existing Hadoop MapReduce projects can migrate to Spark and leverage Spark Core as execution engine1 You can often reuse your mapper and
reducer functions and just call them in Spark from Java or Scala
2 You can translate your code from MapReduce to Apache Spark How-to Translate from MapReduce to Apache Spark
httpblogclouderacomblog201409how-to-translate-from-mapreduce-to-apache-spark
33
2 Transition3 The following tools originally based on Hadoop MapReduce are being ported to Apache Spark
bull Pig Hive Sqoop Cascading Crunch Mahout hellip
34
Pig on Spark (Spork)bull Run Pig with ldquondashx sparkrdquo option for an easy migration
without development effortbull Speed up your existing pig scripts on Spark ( Query
Logical Plan Physical Pan)bull Leverage new Spark specific operators in Pig such as
Cachebull Still leverage many existing Pig UDF librariesbull Pig on Spark Umbrella Jira (Status Passed end-to-end test
cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059bull Fix outstanding issues and address additional Spark functionality
through the community
bull lsquoPig on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag19
35
Hive on Spark (Currently in Beta Expected in Hive 110)
bull New alternative to using MapReduce or Tez hivegt set hiveexecutionengine=sparkbull Help existing Hive applications running on
MapReduce or Tez easily migrate to Spark without development effort
bull Exposes Spark users to a viable feature-rich de facto standard SQL tool on Hadoop
bull Performance benefits especially for Hive queries involving multiple reducer stages
bull Hive on Spark Umbrella Jira (Status Open) Q1 2015 httpsissuesapacheorgjirabrowseHIVE-7292
36
Hive on Spark (Currently in Beta Expected in Hive 110)
bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-motivations-and-design-principles
bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Started
bull Hive on Spark February 11 2015 Szehon Ho Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark
bull Hive on spark is blazing fast or is it Carter Shanklin and Mostapah Mokhtar (Hortonworks) February 20 2015 httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final
bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12
37
Sqoop on Spark (Expected in Sqoop 2)
bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop
bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources
bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal
bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532
38
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop
bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark
39
Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml
40
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
41
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
42
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
43
3 IntegrationService Open Source Tool
StorageServing Layer
Data Formats
Data Ingestion ServicesResource Management
Search
SQL
44
3 Integrationbull Spark was designed to read and write data from and to HDFS as
well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851
45
3 Integrationbull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase
46
3 Integration bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra
47
3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new avenues
bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra
48
3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo
httpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights
bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
49
3 Integration bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu
ction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
50
3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph
databasebull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
51
3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A
bull Some issues are critical ones bull Running Spark on YARN
httpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
52
3 Integrationbull Spark SQL provides built in support for Hive tables
bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib
53
3 Integration bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline
Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
54
3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka
55
3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
56
3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data
bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml
57
3 Integrationbull Apache Parquet is a columnar storage format available
to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
58
3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark
SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem
bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format
59
3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
60
3 Integration bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html
61
3 Integration
bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from
MapReduce to Sparkbull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed
62
3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014
63
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
64
4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them
Hadoop ecosystem Spark ecosystem
65
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
66
4 Complementarity + bull Mesos and YARN can work together each for what
it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom
httpsparkbigdatacomcomponenttagstag41
67
4 Complementarity + References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management
bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620
68
4 Complementarity + bull Spark on Tez for efficient ETL https
githubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)
bull Tez supports enterprise security
69
4 Complementarity + bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
70
4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml
71
4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop
72
5 Key Takeaways1 Evolution of compute models is still ongoing Watch
out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
73
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
74
1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt
he-perfect-match-apache-spark-meets-swift
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
19
4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stackbull Apache Spark provides you Big Data computing and more
bull BYOS Bring Your Own Storage bull BYOC Bring Your Own Clusterbull Spark Core httpsparkbigdatacomcomponenttagstag11-core-sparkbull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-
streamingbull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sqlbull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllibbull GraphX httpsparkbigdatacomcomponenttagstag6-graphx
bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned
20
5 Key Takeaways1 Big Data Still one of the most inflated
buzzword 2 Typical Big Data Stack Big Data Stacks look
similar on paper Arenrsquot they3 Apache Hadoop Hadoop is no longer
lsquosynonymousrsquo of Big Data4 Apache Spark Emergence of the Apache
Spark ecosystem
21
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
22
1 Evolution of Programming APIsbull MapReduce in Java is like assembly code of Big
Data httpwikiapacheorghadoopWordCount
bull Pig httppigapacheorg
bull Hive httphiveapacheorg
bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi
bull Cascading httpwwwcascadingorg
bull Scalding A Scala API for Cascading httptwittercomscalding
bull Crunch httpcrunchapacheorg
bull Scrunch httpcrunchapacheorgscrunchhtml
23
1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007 MapReduce v1 was the only choice as a compute model (Execution Engine) on Hadoop Now we have in addition to MapReduce v2 Tez Spark and Flink
bull Batch bull Batchbull Interactive
bull Batchbull Interactivebull Near-Real
time
bull Batchbull Interactivebull Real-Timebull Iterative
bull 1st Generation
bull 2nd
Generationbull 3rd
Generationbull 4th
Generation
24
1 Evolutionbull This is how Hadoop MapReduce is branding itself ldquoA YARN-based
system for parallel processing of large data sets httphadoopapacheorg
bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip
bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job
bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics
25
1 Evolutionbull Tez Hindi for ldquospeedrdquobull This is how Apache Tez is branding itself ldquoThe Apache Tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data It is currently built atop YARNrdquo
Source httptezapacheorg
bull Apachetrade Tez is an extensible framework for building high performance batch and interactive data processing applications coordinated by YARN in Apache Hadoop
26
1 Evolution bull lsquoSparkrsquo for lightning fast speedbull This is how Apache Spark is branding itself ldquoApache Sparktrade is a fast and general engine for large-scale data processingrdquo httpssparkapacheorg
bull Apache Spark is a general purpose cluster computing framework its execution model supports wide variety of use cases batch interactive near-real time
bull The rapid in-memory processing of resilient distributed datasets (RDDs) is the ldquocore capabilityrdquo of Apache Spark
27
1 Evolution Apache Flinkbull Flink German for ldquonimble swift speedyrdquobull This is how Apache Flink is branding itself ldquoFast and
reliable large-scale data processing enginerdquobull Apache Flink httpflinkapacheorg offers
bull Batch and Streaming in the same systembull Beyond DAGs (Cyclic operator graphs)bull Powerful expressive APIsbull Inside-the-system iterationsbull Full Hadoop compatibility bull Automatic language independent optimizer
bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink
28
Hadoop MapReduce vs Tez vs SparkCriteria
License Open SourceApache 20 version 2x
Open SourceApache 20 version 0x
Open SourceApache 20 version 1x
Processing Model
On-Disk (Disk- based parallelization) Batch
On-Disk Batch Interactive
In-Memory On-Disk Batch Interactive Streaming (Near Real-Time)
Language written in
Java Java Scala
API [Java Python Scala] User-Facing
Java[ ISVEngineTool builder]
[Scala Java Python] User-Facing
Libraries None separate tools None [Spark Core Spark Streaming Spark SQL MLlib GraphX]
29
Hadoop MapReduce vs Tez vs SparkCriteria
Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to Hadoop
Ease of Use Difficult to program needs abstractions
No Interactive mode except Hive Pig
Difficult to program
No Interactive mode except Hive Pig
Easy to program no need of abstractionsInteractive mode
Compatibility
to data types and data sources is same
to data types and data sources is same
to data types and data sources is same
YARN integration
YARN application Ground up YARN application
Spark is moving towards YARN
30
Hadoop MapReduce vs Tez vs SparkCriteria
Deployment YARN YARN [Standalone YARN SIMR Mesos hellip]
Performance - Good performance when data fits into memory
- performance degradation otherwise
Security More features and projects
More features and projects
Still in its infancy
Partial support
31
IV Spark with Hadoop
1 Evolution2 Transition3 Integration4 Complementarity5 Key Takeaways
32
2 Transitionbull Existing Hadoop MapReduce projects can migrate to Spark and leverage Spark Core as execution engine1 You can often reuse your mapper and
reducer functions and just call them in Spark from Java or Scala
2 You can translate your code from MapReduce to Apache Spark How-to Translate from MapReduce to Apache Spark
httpblogclouderacomblog201409how-to-translate-from-mapreduce-to-apache-spark
33
2 Transition3 The following tools originally based on Hadoop MapReduce are being ported to Apache Spark
bull Pig Hive Sqoop Cascading Crunch Mahout hellip
34
Pig on Spark (Spork)bull Run Pig with ldquondashx sparkrdquo option for an easy migration
without development effortbull Speed up your existing pig scripts on Spark ( Query
Logical Plan Physical Pan)bull Leverage new Spark specific operators in Pig such as
Cachebull Still leverage many existing Pig UDF librariesbull Pig on Spark Umbrella Jira (Status Passed end-to-end test
cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059bull Fix outstanding issues and address additional Spark functionality
through the community
bull lsquoPig on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag19
35
Hive on Spark (Currently in Beta Expected in Hive 110)
bull New alternative to using MapReduce or Tez hivegt set hiveexecutionengine=sparkbull Help existing Hive applications running on
MapReduce or Tez easily migrate to Spark without development effort
bull Exposes Spark users to a viable feature-rich de facto standard SQL tool on Hadoop
bull Performance benefits especially for Hive queries involving multiple reducer stages
bull Hive on Spark Umbrella Jira (Status Open) Q1 2015 httpsissuesapacheorgjirabrowseHIVE-7292
36
Hive on Spark (Currently in Beta Expected in Hive 110)
bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-motivations-and-design-principles
bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Started
bull Hive on Spark February 11 2015 Szehon Ho Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark
bull Hive on spark is blazing fast or is it Carter Shanklin and Mostapah Mokhtar (Hortonworks) February 20 2015 httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final
bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12
37
Sqoop on Spark (Expected in Sqoop 2)
bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop
bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources
bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal
bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532
38
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop
bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark
39
Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml
40
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
41
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
42
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
43
3 IntegrationService Open Source Tool
StorageServing Layer
Data Formats
Data Ingestion ServicesResource Management
Search
SQL
44
3 Integrationbull Spark was designed to read and write data from and to HDFS as
well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851
45
3 Integrationbull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase
46
3 Integration bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra
47
3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new avenues
bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra
48
3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo
httpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights
bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
49
3 Integration bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu
ction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
50
3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph
databasebull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
51
3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A
bull Some issues are critical ones bull Running Spark on YARN
httpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
52
3 Integrationbull Spark SQL provides built in support for Hive tables
bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib
53
3 Integration bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline
Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
54
3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka
55
3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
56
3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data
bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml
57
3 Integrationbull Apache Parquet is a columnar storage format available
to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
58
3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark
SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem
bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format
59
3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
60
3 Integration bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html
61
3 Integration
bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from
MapReduce to Sparkbull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed
62
3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014
63
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
64
4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them
Hadoop ecosystem Spark ecosystem
65
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
66
4 Complementarity + bull Mesos and YARN can work together each for what
it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom
httpsparkbigdatacomcomponenttagstag41
67
4 Complementarity + References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management
bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620
68
4 Complementarity + bull Spark on Tez for efficient ETL https
githubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)
bull Tez supports enterprise security
69
4 Complementarity + bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
70
4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml
71
4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop
72
5 Key Takeaways1 Evolution of compute models is still ongoing Watch
out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
73
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
74
1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt
he-perfect-match-apache-spark-meets-swift
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
20
5 Key Takeaways1 Big Data Still one of the most inflated
buzzword 2 Typical Big Data Stack Big Data Stacks look
similar on paper Arenrsquot they3 Apache Hadoop Hadoop is no longer
lsquosynonymousrsquo of Big Data4 Apache Spark Emergence of the Apache
Spark ecosystem
21
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
22
1 Evolution of Programming APIsbull MapReduce in Java is like assembly code of Big
Data httpwikiapacheorghadoopWordCount
bull Pig httppigapacheorg
bull Hive httphiveapacheorg
bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi
bull Cascading httpwwwcascadingorg
bull Scalding A Scala API for Cascading httptwittercomscalding
bull Crunch httpcrunchapacheorg
bull Scrunch httpcrunchapacheorgscrunchhtml
23
1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007 MapReduce v1 was the only choice as a compute model (Execution Engine) on Hadoop Now we have in addition to MapReduce v2 Tez Spark and Flink
bull Batch bull Batchbull Interactive
bull Batchbull Interactivebull Near-Real
time
bull Batchbull Interactivebull Real-Timebull Iterative
bull 1st Generation
bull 2nd
Generationbull 3rd
Generationbull 4th
Generation
24
1 Evolutionbull This is how Hadoop MapReduce is branding itself ldquoA YARN-based
system for parallel processing of large data sets httphadoopapacheorg
bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip
bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job
bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics
25
1 Evolutionbull Tez Hindi for ldquospeedrdquobull This is how Apache Tez is branding itself ldquoThe Apache Tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data It is currently built atop YARNrdquo
Source httptezapacheorg
bull Apachetrade Tez is an extensible framework for building high performance batch and interactive data processing applications coordinated by YARN in Apache Hadoop
26
1 Evolution bull lsquoSparkrsquo for lightning fast speedbull This is how Apache Spark is branding itself ldquoApache Sparktrade is a fast and general engine for large-scale data processingrdquo httpssparkapacheorg
bull Apache Spark is a general purpose cluster computing framework its execution model supports wide variety of use cases batch interactive near-real time
bull The rapid in-memory processing of resilient distributed datasets (RDDs) is the ldquocore capabilityrdquo of Apache Spark
27
1 Evolution Apache Flinkbull Flink German for ldquonimble swift speedyrdquobull This is how Apache Flink is branding itself ldquoFast and
reliable large-scale data processing enginerdquobull Apache Flink httpflinkapacheorg offers
bull Batch and Streaming in the same systembull Beyond DAGs (Cyclic operator graphs)bull Powerful expressive APIsbull Inside-the-system iterationsbull Full Hadoop compatibility bull Automatic language independent optimizer
bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink
28
Hadoop MapReduce vs Tez vs SparkCriteria
License Open SourceApache 20 version 2x
Open SourceApache 20 version 0x
Open SourceApache 20 version 1x
Processing Model
On-Disk (Disk- based parallelization) Batch
On-Disk Batch Interactive
In-Memory On-Disk Batch Interactive Streaming (Near Real-Time)
Language written in
Java Java Scala
API [Java Python Scala] User-Facing
Java[ ISVEngineTool builder]
[Scala Java Python] User-Facing
Libraries None separate tools None [Spark Core Spark Streaming Spark SQL MLlib GraphX]
29
Hadoop MapReduce vs Tez vs SparkCriteria
Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to Hadoop
Ease of Use Difficult to program needs abstractions
No Interactive mode except Hive Pig
Difficult to program
No Interactive mode except Hive Pig
Easy to program no need of abstractionsInteractive mode
Compatibility
to data types and data sources is same
to data types and data sources is same
to data types and data sources is same
YARN integration
YARN application Ground up YARN application
Spark is moving towards YARN
30
Hadoop MapReduce vs Tez vs SparkCriteria
Deployment YARN YARN [Standalone YARN SIMR Mesos hellip]
Performance - Good performance when data fits into memory
- performance degradation otherwise
Security More features and projects
More features and projects
Still in its infancy
Partial support
31
IV Spark with Hadoop
1 Evolution2 Transition3 Integration4 Complementarity5 Key Takeaways
32
2 Transitionbull Existing Hadoop MapReduce projects can migrate to Spark and leverage Spark Core as execution engine1 You can often reuse your mapper and
reducer functions and just call them in Spark from Java or Scala
2 You can translate your code from MapReduce to Apache Spark How-to Translate from MapReduce to Apache Spark
httpblogclouderacomblog201409how-to-translate-from-mapreduce-to-apache-spark
33
2 Transition3 The following tools originally based on Hadoop MapReduce are being ported to Apache Spark
bull Pig Hive Sqoop Cascading Crunch Mahout hellip
34
Pig on Spark (Spork)bull Run Pig with ldquondashx sparkrdquo option for an easy migration
without development effortbull Speed up your existing pig scripts on Spark ( Query
Logical Plan Physical Pan)bull Leverage new Spark specific operators in Pig such as
Cachebull Still leverage many existing Pig UDF librariesbull Pig on Spark Umbrella Jira (Status Passed end-to-end test
cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059bull Fix outstanding issues and address additional Spark functionality
through the community
bull lsquoPig on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag19
35
Hive on Spark (Currently in Beta Expected in Hive 110)
bull New alternative to using MapReduce or Tez hivegt set hiveexecutionengine=sparkbull Help existing Hive applications running on
MapReduce or Tez easily migrate to Spark without development effort
bull Exposes Spark users to a viable feature-rich de facto standard SQL tool on Hadoop
bull Performance benefits especially for Hive queries involving multiple reducer stages
bull Hive on Spark Umbrella Jira (Status Open) Q1 2015 httpsissuesapacheorgjirabrowseHIVE-7292
36
Hive on Spark (Currently in Beta Expected in Hive 110)
bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-motivations-and-design-principles
bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Started
bull Hive on Spark February 11 2015 Szehon Ho Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark
bull Hive on spark is blazing fast or is it Carter Shanklin and Mostapah Mokhtar (Hortonworks) February 20 2015 httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final
bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12
37
Sqoop on Spark (Expected in Sqoop 2)
bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop
bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources
bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal
bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532
38
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop
bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark
39
Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml
40
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
41
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
42
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
43
3 IntegrationService Open Source Tool
StorageServing Layer
Data Formats
Data Ingestion ServicesResource Management
Search
SQL
44
3 Integrationbull Spark was designed to read and write data from and to HDFS as
well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851
45
3 Integrationbull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase
46
3 Integration bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra
47
3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new avenues
bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra
48
3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo
httpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights
bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
49
3 Integration bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu
ction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
50
3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph
databasebull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
51
3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A
bull Some issues are critical ones bull Running Spark on YARN
httpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
52
3 Integrationbull Spark SQL provides built in support for Hive tables
bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib
53
3 Integration bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline
Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
54
3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka
55
3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
56
3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data
bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml
57
3 Integrationbull Apache Parquet is a columnar storage format available
to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
58
3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark
SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem
bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format
59
3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
60
3 Integration bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html
61
3 Integration
bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from
MapReduce to Sparkbull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed
62
3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014
63
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
64
4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them
Hadoop ecosystem Spark ecosystem
65
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
66
4 Complementarity + bull Mesos and YARN can work together each for what
it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom
httpsparkbigdatacomcomponenttagstag41
67
4 Complementarity + References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management
bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620
68
4 Complementarity + bull Spark on Tez for efficient ETL https
githubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)
bull Tez supports enterprise security
69
4 Complementarity + bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
70
4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml
71
4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop
72
5 Key Takeaways1 Evolution of compute models is still ongoing Watch
out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
73
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
74
1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt
he-perfect-match-apache-spark-meets-swift
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
21
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
22
1 Evolution of Programming APIsbull MapReduce in Java is like assembly code of Big
Data httpwikiapacheorghadoopWordCount
bull Pig httppigapacheorg
bull Hive httphiveapacheorg
bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi
bull Cascading httpwwwcascadingorg
bull Scalding A Scala API for Cascading httptwittercomscalding
bull Crunch httpcrunchapacheorg
bull Scrunch httpcrunchapacheorgscrunchhtml
23
1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007 MapReduce v1 was the only choice as a compute model (Execution Engine) on Hadoop Now we have in addition to MapReduce v2 Tez Spark and Flink
bull Batch bull Batchbull Interactive
bull Batchbull Interactivebull Near-Real
time
bull Batchbull Interactivebull Real-Timebull Iterative
bull 1st Generation
bull 2nd
Generationbull 3rd
Generationbull 4th
Generation
24
1 Evolutionbull This is how Hadoop MapReduce is branding itself ldquoA YARN-based
system for parallel processing of large data sets httphadoopapacheorg
bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip
bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job
bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics
25
1 Evolutionbull Tez Hindi for ldquospeedrdquobull This is how Apache Tez is branding itself ldquoThe Apache Tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data It is currently built atop YARNrdquo
Source httptezapacheorg
bull Apachetrade Tez is an extensible framework for building high performance batch and interactive data processing applications coordinated by YARN in Apache Hadoop
26
1 Evolution bull lsquoSparkrsquo for lightning fast speedbull This is how Apache Spark is branding itself ldquoApache Sparktrade is a fast and general engine for large-scale data processingrdquo httpssparkapacheorg
bull Apache Spark is a general purpose cluster computing framework its execution model supports wide variety of use cases batch interactive near-real time
bull The rapid in-memory processing of resilient distributed datasets (RDDs) is the ldquocore capabilityrdquo of Apache Spark
27
1 Evolution Apache Flinkbull Flink German for ldquonimble swift speedyrdquobull This is how Apache Flink is branding itself ldquoFast and
reliable large-scale data processing enginerdquobull Apache Flink httpflinkapacheorg offers
bull Batch and Streaming in the same systembull Beyond DAGs (Cyclic operator graphs)bull Powerful expressive APIsbull Inside-the-system iterationsbull Full Hadoop compatibility bull Automatic language independent optimizer
bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink
28
Hadoop MapReduce vs Tez vs SparkCriteria
License Open SourceApache 20 version 2x
Open SourceApache 20 version 0x
Open SourceApache 20 version 1x
Processing Model
On-Disk (Disk- based parallelization) Batch
On-Disk Batch Interactive
In-Memory On-Disk Batch Interactive Streaming (Near Real-Time)
Language written in
Java Java Scala
API [Java Python Scala] User-Facing
Java[ ISVEngineTool builder]
[Scala Java Python] User-Facing
Libraries None separate tools None [Spark Core Spark Streaming Spark SQL MLlib GraphX]
29
Hadoop MapReduce vs Tez vs SparkCriteria
Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to Hadoop
Ease of Use Difficult to program needs abstractions
No Interactive mode except Hive Pig
Difficult to program
No Interactive mode except Hive Pig
Easy to program no need of abstractionsInteractive mode
Compatibility
to data types and data sources is same
to data types and data sources is same
to data types and data sources is same
YARN integration
YARN application Ground up YARN application
Spark is moving towards YARN
30
Hadoop MapReduce vs Tez vs SparkCriteria
Deployment YARN YARN [Standalone YARN SIMR Mesos hellip]
Performance - Good performance when data fits into memory
- performance degradation otherwise
Security More features and projects
More features and projects
Still in its infancy
Partial support
31
IV Spark with Hadoop
1 Evolution2 Transition3 Integration4 Complementarity5 Key Takeaways
32
2 Transitionbull Existing Hadoop MapReduce projects can migrate to Spark and leverage Spark Core as execution engine1 You can often reuse your mapper and
reducer functions and just call them in Spark from Java or Scala
2 You can translate your code from MapReduce to Apache Spark How-to Translate from MapReduce to Apache Spark
httpblogclouderacomblog201409how-to-translate-from-mapreduce-to-apache-spark
33
2 Transition3 The following tools originally based on Hadoop MapReduce are being ported to Apache Spark
bull Pig Hive Sqoop Cascading Crunch Mahout hellip
34
Pig on Spark (Spork)bull Run Pig with ldquondashx sparkrdquo option for an easy migration
without development effortbull Speed up your existing pig scripts on Spark ( Query
Logical Plan Physical Pan)bull Leverage new Spark specific operators in Pig such as
Cachebull Still leverage many existing Pig UDF librariesbull Pig on Spark Umbrella Jira (Status Passed end-to-end test
cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059bull Fix outstanding issues and address additional Spark functionality
through the community
bull lsquoPig on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag19
35
Hive on Spark (Currently in Beta Expected in Hive 110)
bull New alternative to using MapReduce or Tez hivegt set hiveexecutionengine=sparkbull Help existing Hive applications running on
MapReduce or Tez easily migrate to Spark without development effort
bull Exposes Spark users to a viable feature-rich de facto standard SQL tool on Hadoop
bull Performance benefits especially for Hive queries involving multiple reducer stages
bull Hive on Spark Umbrella Jira (Status Open) Q1 2015 httpsissuesapacheorgjirabrowseHIVE-7292
36
Hive on Spark (Currently in Beta Expected in Hive 110)
bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-motivations-and-design-principles
bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Started
bull Hive on Spark February 11 2015 Szehon Ho Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark
bull Hive on spark is blazing fast or is it Carter Shanklin and Mostapah Mokhtar (Hortonworks) February 20 2015 httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final
bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12
37
Sqoop on Spark (Expected in Sqoop 2)
bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop
bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources
bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal
bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532
38
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop
bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark
39
Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml
40
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
41
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
42
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
43
3 IntegrationService Open Source Tool
StorageServing Layer
Data Formats
Data Ingestion ServicesResource Management
Search
SQL
44
3 Integrationbull Spark was designed to read and write data from and to HDFS as
well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851
45
3 Integrationbull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase
46
3 Integration bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra
47
3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new avenues
bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra
48
3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo
httpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights
bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
49
3 Integration bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu
ction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
50
3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph
databasebull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
51
3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A
bull Some issues are critical ones bull Running Spark on YARN
httpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
52
3 Integrationbull Spark SQL provides built in support for Hive tables
bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib
53
3 Integration bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline
Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
54
3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka
55
3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
56
3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data
bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml
57
3 Integrationbull Apache Parquet is a columnar storage format available
to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
58
3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark
SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem
bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format
59
3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
60
3 Integration bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html
61
3 Integration
bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from
MapReduce to Sparkbull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed
62
3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014
63
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
64
4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them
Hadoop ecosystem Spark ecosystem
65
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
66
4 Complementarity + bull Mesos and YARN can work together each for what
it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom
httpsparkbigdatacomcomponenttagstag41
67
4 Complementarity + References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management
bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620
68
4 Complementarity + bull Spark on Tez for efficient ETL https
githubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)
bull Tez supports enterprise security
69
4 Complementarity + bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
70
4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml
71
4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop
72
5 Key Takeaways1 Evolution of compute models is still ongoing Watch
out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
73
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
74
1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt
he-perfect-match-apache-spark-meets-swift
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
22
1 Evolution of Programming APIsbull MapReduce in Java is like assembly code of Big
Data httpwikiapacheorghadoopWordCount
bull Pig httppigapacheorg
bull Hive httphiveapacheorg
bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi
bull Cascading httpwwwcascadingorg
bull Scalding A Scala API for Cascading httptwittercomscalding
bull Crunch httpcrunchapacheorg
bull Scrunch httpcrunchapacheorgscrunchhtml
23
1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007 MapReduce v1 was the only choice as a compute model (Execution Engine) on Hadoop Now we have in addition to MapReduce v2 Tez Spark and Flink
bull Batch bull Batchbull Interactive
bull Batchbull Interactivebull Near-Real
time
bull Batchbull Interactivebull Real-Timebull Iterative
bull 1st Generation
bull 2nd
Generationbull 3rd
Generationbull 4th
Generation
24
1 Evolutionbull This is how Hadoop MapReduce is branding itself ldquoA YARN-based
system for parallel processing of large data sets httphadoopapacheorg
bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip
bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job
bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics
25
1 Evolutionbull Tez Hindi for ldquospeedrdquobull This is how Apache Tez is branding itself ldquoThe Apache Tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data It is currently built atop YARNrdquo
Source httptezapacheorg
bull Apachetrade Tez is an extensible framework for building high performance batch and interactive data processing applications coordinated by YARN in Apache Hadoop
26
1 Evolution bull lsquoSparkrsquo for lightning fast speedbull This is how Apache Spark is branding itself ldquoApache Sparktrade is a fast and general engine for large-scale data processingrdquo httpssparkapacheorg
bull Apache Spark is a general purpose cluster computing framework its execution model supports wide variety of use cases batch interactive near-real time
bull The rapid in-memory processing of resilient distributed datasets (RDDs) is the ldquocore capabilityrdquo of Apache Spark
27
1 Evolution Apache Flinkbull Flink German for ldquonimble swift speedyrdquobull This is how Apache Flink is branding itself ldquoFast and
reliable large-scale data processing enginerdquobull Apache Flink httpflinkapacheorg offers
bull Batch and Streaming in the same systembull Beyond DAGs (Cyclic operator graphs)bull Powerful expressive APIsbull Inside-the-system iterationsbull Full Hadoop compatibility bull Automatic language independent optimizer
bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink
28
Hadoop MapReduce vs Tez vs SparkCriteria
License Open SourceApache 20 version 2x
Open SourceApache 20 version 0x
Open SourceApache 20 version 1x
Processing Model
On-Disk (Disk- based parallelization) Batch
On-Disk Batch Interactive
In-Memory On-Disk Batch Interactive Streaming (Near Real-Time)
Language written in
Java Java Scala
API [Java Python Scala] User-Facing
Java[ ISVEngineTool builder]
[Scala Java Python] User-Facing
Libraries None separate tools None [Spark Core Spark Streaming Spark SQL MLlib GraphX]
29
Hadoop MapReduce vs Tez vs SparkCriteria
Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to Hadoop
Ease of Use Difficult to program needs abstractions
No Interactive mode except Hive Pig
Difficult to program
No Interactive mode except Hive Pig
Easy to program no need of abstractionsInteractive mode
Compatibility
to data types and data sources is same
to data types and data sources is same
to data types and data sources is same
YARN integration
YARN application Ground up YARN application
Spark is moving towards YARN
30
Hadoop MapReduce vs Tez vs SparkCriteria
Deployment YARN YARN [Standalone YARN SIMR Mesos hellip]
Performance - Good performance when data fits into memory
- performance degradation otherwise
Security More features and projects
More features and projects
Still in its infancy
Partial support
31
IV Spark with Hadoop
1 Evolution2 Transition3 Integration4 Complementarity5 Key Takeaways
32
2 Transitionbull Existing Hadoop MapReduce projects can migrate to Spark and leverage Spark Core as execution engine1 You can often reuse your mapper and
reducer functions and just call them in Spark from Java or Scala
2 You can translate your code from MapReduce to Apache Spark How-to Translate from MapReduce to Apache Spark
httpblogclouderacomblog201409how-to-translate-from-mapreduce-to-apache-spark
33
2 Transition3 The following tools originally based on Hadoop MapReduce are being ported to Apache Spark
bull Pig Hive Sqoop Cascading Crunch Mahout hellip
34
Pig on Spark (Spork)bull Run Pig with ldquondashx sparkrdquo option for an easy migration
without development effortbull Speed up your existing pig scripts on Spark ( Query
Logical Plan Physical Pan)bull Leverage new Spark specific operators in Pig such as
Cachebull Still leverage many existing Pig UDF librariesbull Pig on Spark Umbrella Jira (Status Passed end-to-end test
cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059bull Fix outstanding issues and address additional Spark functionality
through the community
bull lsquoPig on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag19
35
Hive on Spark (Currently in Beta Expected in Hive 110)
bull New alternative to using MapReduce or Tez hivegt set hiveexecutionengine=sparkbull Help existing Hive applications running on
MapReduce or Tez easily migrate to Spark without development effort
bull Exposes Spark users to a viable feature-rich de facto standard SQL tool on Hadoop
bull Performance benefits especially for Hive queries involving multiple reducer stages
bull Hive on Spark Umbrella Jira (Status Open) Q1 2015 httpsissuesapacheorgjirabrowseHIVE-7292
36
Hive on Spark (Currently in Beta Expected in Hive 110)
bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-motivations-and-design-principles
bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Started
bull Hive on Spark February 11 2015 Szehon Ho Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark
bull Hive on spark is blazing fast or is it Carter Shanklin and Mostapah Mokhtar (Hortonworks) February 20 2015 httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final
bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12
37
Sqoop on Spark (Expected in Sqoop 2)
bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop
bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources
bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal
bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532
38
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop
bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark
39
Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml
40
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
41
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
42
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
43
3 IntegrationService Open Source Tool
StorageServing Layer
Data Formats
Data Ingestion ServicesResource Management
Search
SQL
44
3 Integrationbull Spark was designed to read and write data from and to HDFS as
well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851
45
3 Integrationbull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase
46
3 Integration bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra
47
3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new avenues
bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra
48
3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo
httpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights
bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
49
3 Integration bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu
ction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
50
3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph
databasebull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
51
3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A
bull Some issues are critical ones bull Running Spark on YARN
httpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
52
3 Integrationbull Spark SQL provides built in support for Hive tables
bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib
53
3 Integration bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline
Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
54
3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka
55
3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
56
3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data
bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml
57
3 Integrationbull Apache Parquet is a columnar storage format available
to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
58
3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark
SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem
bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format
59
3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
60
3 Integration bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html
61
3 Integration
bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from
MapReduce to Sparkbull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed
62
3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014
63
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
64
4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them
Hadoop ecosystem Spark ecosystem
65
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
66
4 Complementarity + bull Mesos and YARN can work together each for what
it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom
httpsparkbigdatacomcomponenttagstag41
67
4 Complementarity + References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management
bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620
68
4 Complementarity + bull Spark on Tez for efficient ETL https
githubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)
bull Tez supports enterprise security
69
4 Complementarity + bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
70
4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml
71
4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop
72
5 Key Takeaways1 Evolution of compute models is still ongoing Watch
out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
73
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
74
1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt
he-perfect-match-apache-spark-meets-swift
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
23
1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007 MapReduce v1 was the only choice as a compute model (Execution Engine) on Hadoop Now we have in addition to MapReduce v2 Tez Spark and Flink
bull Batch bull Batchbull Interactive
bull Batchbull Interactivebull Near-Real
time
bull Batchbull Interactivebull Real-Timebull Iterative
bull 1st Generation
bull 2nd
Generationbull 3rd
Generationbull 4th
Generation
24
1 Evolutionbull This is how Hadoop MapReduce is branding itself ldquoA YARN-based
system for parallel processing of large data sets httphadoopapacheorg
bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip
bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job
bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics
25
1 Evolutionbull Tez Hindi for ldquospeedrdquobull This is how Apache Tez is branding itself ldquoThe Apache Tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data It is currently built atop YARNrdquo
Source httptezapacheorg
bull Apachetrade Tez is an extensible framework for building high performance batch and interactive data processing applications coordinated by YARN in Apache Hadoop
26
1 Evolution bull lsquoSparkrsquo for lightning fast speedbull This is how Apache Spark is branding itself ldquoApache Sparktrade is a fast and general engine for large-scale data processingrdquo httpssparkapacheorg
bull Apache Spark is a general purpose cluster computing framework its execution model supports wide variety of use cases batch interactive near-real time
bull The rapid in-memory processing of resilient distributed datasets (RDDs) is the ldquocore capabilityrdquo of Apache Spark
27
1 Evolution Apache Flinkbull Flink German for ldquonimble swift speedyrdquobull This is how Apache Flink is branding itself ldquoFast and
reliable large-scale data processing enginerdquobull Apache Flink httpflinkapacheorg offers
bull Batch and Streaming in the same systembull Beyond DAGs (Cyclic operator graphs)bull Powerful expressive APIsbull Inside-the-system iterationsbull Full Hadoop compatibility bull Automatic language independent optimizer
bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink
28
Hadoop MapReduce vs Tez vs SparkCriteria
License Open SourceApache 20 version 2x
Open SourceApache 20 version 0x
Open SourceApache 20 version 1x
Processing Model
On-Disk (Disk- based parallelization) Batch
On-Disk Batch Interactive
In-Memory On-Disk Batch Interactive Streaming (Near Real-Time)
Language written in
Java Java Scala
API [Java Python Scala] User-Facing
Java[ ISVEngineTool builder]
[Scala Java Python] User-Facing
Libraries None separate tools None [Spark Core Spark Streaming Spark SQL MLlib GraphX]
29
Hadoop MapReduce vs Tez vs SparkCriteria
Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to Hadoop
Ease of Use Difficult to program needs abstractions
No Interactive mode except Hive Pig
Difficult to program
No Interactive mode except Hive Pig
Easy to program no need of abstractionsInteractive mode
Compatibility
to data types and data sources is same
to data types and data sources is same
to data types and data sources is same
YARN integration
YARN application Ground up YARN application
Spark is moving towards YARN
30
Hadoop MapReduce vs Tez vs SparkCriteria
Deployment YARN YARN [Standalone YARN SIMR Mesos hellip]
Performance - Good performance when data fits into memory
- performance degradation otherwise
Security More features and projects
More features and projects
Still in its infancy
Partial support
31
IV Spark with Hadoop
1 Evolution2 Transition3 Integration4 Complementarity5 Key Takeaways
32
2 Transitionbull Existing Hadoop MapReduce projects can migrate to Spark and leverage Spark Core as execution engine1 You can often reuse your mapper and
reducer functions and just call them in Spark from Java or Scala
2 You can translate your code from MapReduce to Apache Spark How-to Translate from MapReduce to Apache Spark
httpblogclouderacomblog201409how-to-translate-from-mapreduce-to-apache-spark
33
2 Transition3 The following tools originally based on Hadoop MapReduce are being ported to Apache Spark
bull Pig Hive Sqoop Cascading Crunch Mahout hellip
34
Pig on Spark (Spork)bull Run Pig with ldquondashx sparkrdquo option for an easy migration
without development effortbull Speed up your existing pig scripts on Spark ( Query
Logical Plan Physical Pan)bull Leverage new Spark specific operators in Pig such as
Cachebull Still leverage many existing Pig UDF librariesbull Pig on Spark Umbrella Jira (Status Passed end-to-end test
cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059bull Fix outstanding issues and address additional Spark functionality
through the community
bull lsquoPig on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag19
35
Hive on Spark (Currently in Beta Expected in Hive 110)
bull New alternative to using MapReduce or Tez hivegt set hiveexecutionengine=sparkbull Help existing Hive applications running on
MapReduce or Tez easily migrate to Spark without development effort
bull Exposes Spark users to a viable feature-rich de facto standard SQL tool on Hadoop
bull Performance benefits especially for Hive queries involving multiple reducer stages
bull Hive on Spark Umbrella Jira (Status Open) Q1 2015 httpsissuesapacheorgjirabrowseHIVE-7292
36
Hive on Spark (Currently in Beta Expected in Hive 110)
bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-motivations-and-design-principles
bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Started
bull Hive on Spark February 11 2015 Szehon Ho Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark
bull Hive on spark is blazing fast or is it Carter Shanklin and Mostapah Mokhtar (Hortonworks) February 20 2015 httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final
bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12
37
Sqoop on Spark (Expected in Sqoop 2)
bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop
bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources
bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal
bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532
38
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop
bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark
39
Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml
40
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
41
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
42
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
43
3 IntegrationService Open Source Tool
StorageServing Layer
Data Formats
Data Ingestion ServicesResource Management
Search
SQL
44
3 Integrationbull Spark was designed to read and write data from and to HDFS as
well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851
45
3 Integrationbull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase
46
3 Integration bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra
47
3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new avenues
bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra
48
3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo
httpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights
bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
49
3 Integration bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu
ction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
50
3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph
databasebull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
51
3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A
bull Some issues are critical ones bull Running Spark on YARN
httpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
52
3 Integrationbull Spark SQL provides built in support for Hive tables
bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib
53
3 Integration bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline
Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
54
3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka
55
3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
56
3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data
bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml
57
3 Integrationbull Apache Parquet is a columnar storage format available
to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
58
3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark
SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem
bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format
59
3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
60
3 Integration bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html
61
3 Integration
bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from
MapReduce to Sparkbull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed
62
3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014
63
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
64
4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them
Hadoop ecosystem Spark ecosystem
65
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
66
4 Complementarity + bull Mesos and YARN can work together each for what
it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom
httpsparkbigdatacomcomponenttagstag41
67
4 Complementarity + References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management
bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620
68
4 Complementarity + bull Spark on Tez for efficient ETL https
githubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)
bull Tez supports enterprise security
69
4 Complementarity + bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
70
4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml
71
4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop
72
5 Key Takeaways1 Evolution of compute models is still ongoing Watch
out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
73
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
74
1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt
he-perfect-match-apache-spark-meets-swift
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
24
1 Evolutionbull This is how Hadoop MapReduce is branding itself ldquoA YARN-based
system for parallel processing of large data sets httphadoopapacheorg
bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip
bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job
bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics
25
1 Evolutionbull Tez Hindi for ldquospeedrdquobull This is how Apache Tez is branding itself ldquoThe Apache Tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data It is currently built atop YARNrdquo
Source httptezapacheorg
bull Apachetrade Tez is an extensible framework for building high performance batch and interactive data processing applications coordinated by YARN in Apache Hadoop
26
1 Evolution bull lsquoSparkrsquo for lightning fast speedbull This is how Apache Spark is branding itself ldquoApache Sparktrade is a fast and general engine for large-scale data processingrdquo httpssparkapacheorg
bull Apache Spark is a general purpose cluster computing framework its execution model supports wide variety of use cases batch interactive near-real time
bull The rapid in-memory processing of resilient distributed datasets (RDDs) is the ldquocore capabilityrdquo of Apache Spark
27
1 Evolution Apache Flinkbull Flink German for ldquonimble swift speedyrdquobull This is how Apache Flink is branding itself ldquoFast and
reliable large-scale data processing enginerdquobull Apache Flink httpflinkapacheorg offers
bull Batch and Streaming in the same systembull Beyond DAGs (Cyclic operator graphs)bull Powerful expressive APIsbull Inside-the-system iterationsbull Full Hadoop compatibility bull Automatic language independent optimizer
bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink
28
Hadoop MapReduce vs Tez vs SparkCriteria
License Open SourceApache 20 version 2x
Open SourceApache 20 version 0x
Open SourceApache 20 version 1x
Processing Model
On-Disk (Disk- based parallelization) Batch
On-Disk Batch Interactive
In-Memory On-Disk Batch Interactive Streaming (Near Real-Time)
Language written in
Java Java Scala
API [Java Python Scala] User-Facing
Java[ ISVEngineTool builder]
[Scala Java Python] User-Facing
Libraries None separate tools None [Spark Core Spark Streaming Spark SQL MLlib GraphX]
29
Hadoop MapReduce vs Tez vs SparkCriteria
Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to Hadoop
Ease of Use Difficult to program needs abstractions
No Interactive mode except Hive Pig
Difficult to program
No Interactive mode except Hive Pig
Easy to program no need of abstractionsInteractive mode
Compatibility
to data types and data sources is same
to data types and data sources is same
to data types and data sources is same
YARN integration
YARN application Ground up YARN application
Spark is moving towards YARN
30
Hadoop MapReduce vs Tez vs SparkCriteria
Deployment YARN YARN [Standalone YARN SIMR Mesos hellip]
Performance - Good performance when data fits into memory
- performance degradation otherwise
Security More features and projects
More features and projects
Still in its infancy
Partial support
31
IV Spark with Hadoop
1 Evolution2 Transition3 Integration4 Complementarity5 Key Takeaways
32
2 Transitionbull Existing Hadoop MapReduce projects can migrate to Spark and leverage Spark Core as execution engine1 You can often reuse your mapper and
reducer functions and just call them in Spark from Java or Scala
2 You can translate your code from MapReduce to Apache Spark How-to Translate from MapReduce to Apache Spark
httpblogclouderacomblog201409how-to-translate-from-mapreduce-to-apache-spark
33
2 Transition3 The following tools originally based on Hadoop MapReduce are being ported to Apache Spark
bull Pig Hive Sqoop Cascading Crunch Mahout hellip
34
Pig on Spark (Spork)bull Run Pig with ldquondashx sparkrdquo option for an easy migration
without development effortbull Speed up your existing pig scripts on Spark ( Query
Logical Plan Physical Pan)bull Leverage new Spark specific operators in Pig such as
Cachebull Still leverage many existing Pig UDF librariesbull Pig on Spark Umbrella Jira (Status Passed end-to-end test
cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059bull Fix outstanding issues and address additional Spark functionality
through the community
bull lsquoPig on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag19
35
Hive on Spark (Currently in Beta Expected in Hive 110)
bull New alternative to using MapReduce or Tez hivegt set hiveexecutionengine=sparkbull Help existing Hive applications running on
MapReduce or Tez easily migrate to Spark without development effort
bull Exposes Spark users to a viable feature-rich de facto standard SQL tool on Hadoop
bull Performance benefits especially for Hive queries involving multiple reducer stages
bull Hive on Spark Umbrella Jira (Status Open) Q1 2015 httpsissuesapacheorgjirabrowseHIVE-7292
36
Hive on Spark (Currently in Beta Expected in Hive 110)
bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-motivations-and-design-principles
bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Started
bull Hive on Spark February 11 2015 Szehon Ho Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark
bull Hive on spark is blazing fast or is it Carter Shanklin and Mostapah Mokhtar (Hortonworks) February 20 2015 httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final
bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12
37
Sqoop on Spark (Expected in Sqoop 2)
bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop
bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources
bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal
bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532
38
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop
bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark
39
Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml
40
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
41
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
42
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
43
3 IntegrationService Open Source Tool
StorageServing Layer
Data Formats
Data Ingestion ServicesResource Management
Search
SQL
44
3 Integrationbull Spark was designed to read and write data from and to HDFS as
well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851
45
3 Integrationbull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase
46
3 Integration bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra
47
3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new avenues
bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra
48
3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo
httpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights
bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
49
3 Integration bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu
ction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
50
3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph
databasebull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
51
3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A
bull Some issues are critical ones bull Running Spark on YARN
httpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
52
3 Integrationbull Spark SQL provides built in support for Hive tables
bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib
53
3 Integration bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline
Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
54
3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka
55
3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
56
3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data
bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml
57
3 Integrationbull Apache Parquet is a columnar storage format available
to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
58
3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark
SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem
bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format
59
3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
60
3 Integration bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html
61
3 Integration
bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from
MapReduce to Sparkbull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed
62
3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014
63
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
64
4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them
Hadoop ecosystem Spark ecosystem
65
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
66
4 Complementarity + bull Mesos and YARN can work together each for what
it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom
httpsparkbigdatacomcomponenttagstag41
67
4 Complementarity + References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management
bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620
68
4 Complementarity + bull Spark on Tez for efficient ETL https
githubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)
bull Tez supports enterprise security
69
4 Complementarity + bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
70
4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml
71
4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop
72
5 Key Takeaways1 Evolution of compute models is still ongoing Watch
out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
73
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
74
1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt
he-perfect-match-apache-spark-meets-swift
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
25
1 Evolutionbull Tez Hindi for ldquospeedrdquobull This is how Apache Tez is branding itself ldquoThe Apache Tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data It is currently built atop YARNrdquo
Source httptezapacheorg
bull Apachetrade Tez is an extensible framework for building high performance batch and interactive data processing applications coordinated by YARN in Apache Hadoop
26
1 Evolution bull lsquoSparkrsquo for lightning fast speedbull This is how Apache Spark is branding itself ldquoApache Sparktrade is a fast and general engine for large-scale data processingrdquo httpssparkapacheorg
bull Apache Spark is a general purpose cluster computing framework its execution model supports wide variety of use cases batch interactive near-real time
bull The rapid in-memory processing of resilient distributed datasets (RDDs) is the ldquocore capabilityrdquo of Apache Spark
27
1 Evolution Apache Flinkbull Flink German for ldquonimble swift speedyrdquobull This is how Apache Flink is branding itself ldquoFast and
reliable large-scale data processing enginerdquobull Apache Flink httpflinkapacheorg offers
bull Batch and Streaming in the same systembull Beyond DAGs (Cyclic operator graphs)bull Powerful expressive APIsbull Inside-the-system iterationsbull Full Hadoop compatibility bull Automatic language independent optimizer
bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink
28
Hadoop MapReduce vs Tez vs SparkCriteria
License Open SourceApache 20 version 2x
Open SourceApache 20 version 0x
Open SourceApache 20 version 1x
Processing Model
On-Disk (Disk- based parallelization) Batch
On-Disk Batch Interactive
In-Memory On-Disk Batch Interactive Streaming (Near Real-Time)
Language written in
Java Java Scala
API [Java Python Scala] User-Facing
Java[ ISVEngineTool builder]
[Scala Java Python] User-Facing
Libraries None separate tools None [Spark Core Spark Streaming Spark SQL MLlib GraphX]
29
Hadoop MapReduce vs Tez vs SparkCriteria
Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to Hadoop
Ease of Use Difficult to program needs abstractions
No Interactive mode except Hive Pig
Difficult to program
No Interactive mode except Hive Pig
Easy to program no need of abstractionsInteractive mode
Compatibility
to data types and data sources is same
to data types and data sources is same
to data types and data sources is same
YARN integration
YARN application Ground up YARN application
Spark is moving towards YARN
30
Hadoop MapReduce vs Tez vs SparkCriteria
Deployment YARN YARN [Standalone YARN SIMR Mesos hellip]
Performance - Good performance when data fits into memory
- performance degradation otherwise
Security More features and projects
More features and projects
Still in its infancy
Partial support
31
IV Spark with Hadoop
1 Evolution2 Transition3 Integration4 Complementarity5 Key Takeaways
32
2 Transitionbull Existing Hadoop MapReduce projects can migrate to Spark and leverage Spark Core as execution engine1 You can often reuse your mapper and
reducer functions and just call them in Spark from Java or Scala
2 You can translate your code from MapReduce to Apache Spark How-to Translate from MapReduce to Apache Spark
httpblogclouderacomblog201409how-to-translate-from-mapreduce-to-apache-spark
33
2 Transition3 The following tools originally based on Hadoop MapReduce are being ported to Apache Spark
bull Pig Hive Sqoop Cascading Crunch Mahout hellip
34
Pig on Spark (Spork)bull Run Pig with ldquondashx sparkrdquo option for an easy migration
without development effortbull Speed up your existing pig scripts on Spark ( Query
Logical Plan Physical Pan)bull Leverage new Spark specific operators in Pig such as
Cachebull Still leverage many existing Pig UDF librariesbull Pig on Spark Umbrella Jira (Status Passed end-to-end test
cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059bull Fix outstanding issues and address additional Spark functionality
through the community
bull lsquoPig on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag19
35
Hive on Spark (Currently in Beta Expected in Hive 110)
bull New alternative to using MapReduce or Tez hivegt set hiveexecutionengine=sparkbull Help existing Hive applications running on
MapReduce or Tez easily migrate to Spark without development effort
bull Exposes Spark users to a viable feature-rich de facto standard SQL tool on Hadoop
bull Performance benefits especially for Hive queries involving multiple reducer stages
bull Hive on Spark Umbrella Jira (Status Open) Q1 2015 httpsissuesapacheorgjirabrowseHIVE-7292
36
Hive on Spark (Currently in Beta Expected in Hive 110)
bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-motivations-and-design-principles
bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Started
bull Hive on Spark February 11 2015 Szehon Ho Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark
bull Hive on spark is blazing fast or is it Carter Shanklin and Mostapah Mokhtar (Hortonworks) February 20 2015 httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final
bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12
37
Sqoop on Spark (Expected in Sqoop 2)
bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop
bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources
bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal
bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532
38
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop
bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark
39
Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml
40
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
41
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
42
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
43
3 IntegrationService Open Source Tool
StorageServing Layer
Data Formats
Data Ingestion ServicesResource Management
Search
SQL
44
3 Integrationbull Spark was designed to read and write data from and to HDFS as
well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851
45
3 Integrationbull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase
46
3 Integration bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra
47
3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new avenues
bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra
48
3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo
httpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights
bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
49
3 Integration bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu
ction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
50
3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph
databasebull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
51
3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A
bull Some issues are critical ones bull Running Spark on YARN
httpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
52
3 Integrationbull Spark SQL provides built in support for Hive tables
bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib
53
3 Integration bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline
Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
54
3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka
55
3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
56
3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data
bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml
57
3 Integrationbull Apache Parquet is a columnar storage format available
to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
58
3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark
SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem
bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format
59
3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
60
3 Integration bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html
61
3 Integration
bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from
MapReduce to Sparkbull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed
62
3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014
63
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
64
4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them
Hadoop ecosystem Spark ecosystem
65
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
66
4 Complementarity + bull Mesos and YARN can work together each for what
it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom
httpsparkbigdatacomcomponenttagstag41
67
4 Complementarity + References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management
bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620
68
4 Complementarity + bull Spark on Tez for efficient ETL https
githubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)
bull Tez supports enterprise security
69
4 Complementarity + bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
70
4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml
71
4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop
72
5 Key Takeaways1 Evolution of compute models is still ongoing Watch
out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
73
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
74
1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt
he-perfect-match-apache-spark-meets-swift
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
26
1 Evolution bull lsquoSparkrsquo for lightning fast speedbull This is how Apache Spark is branding itself ldquoApache Sparktrade is a fast and general engine for large-scale data processingrdquo httpssparkapacheorg
bull Apache Spark is a general purpose cluster computing framework its execution model supports wide variety of use cases batch interactive near-real time
bull The rapid in-memory processing of resilient distributed datasets (RDDs) is the ldquocore capabilityrdquo of Apache Spark
27
1 Evolution Apache Flinkbull Flink German for ldquonimble swift speedyrdquobull This is how Apache Flink is branding itself ldquoFast and
reliable large-scale data processing enginerdquobull Apache Flink httpflinkapacheorg offers
bull Batch and Streaming in the same systembull Beyond DAGs (Cyclic operator graphs)bull Powerful expressive APIsbull Inside-the-system iterationsbull Full Hadoop compatibility bull Automatic language independent optimizer
bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink
28
Hadoop MapReduce vs Tez vs SparkCriteria
License Open SourceApache 20 version 2x
Open SourceApache 20 version 0x
Open SourceApache 20 version 1x
Processing Model
On-Disk (Disk- based parallelization) Batch
On-Disk Batch Interactive
In-Memory On-Disk Batch Interactive Streaming (Near Real-Time)
Language written in
Java Java Scala
API [Java Python Scala] User-Facing
Java[ ISVEngineTool builder]
[Scala Java Python] User-Facing
Libraries None separate tools None [Spark Core Spark Streaming Spark SQL MLlib GraphX]
29
Hadoop MapReduce vs Tez vs SparkCriteria
Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to Hadoop
Ease of Use Difficult to program needs abstractions
No Interactive mode except Hive Pig
Difficult to program
No Interactive mode except Hive Pig
Easy to program no need of abstractionsInteractive mode
Compatibility
to data types and data sources is same
to data types and data sources is same
to data types and data sources is same
YARN integration
YARN application Ground up YARN application
Spark is moving towards YARN
30
Hadoop MapReduce vs Tez vs SparkCriteria
Deployment YARN YARN [Standalone YARN SIMR Mesos hellip]
Performance - Good performance when data fits into memory
- performance degradation otherwise
Security More features and projects
More features and projects
Still in its infancy
Partial support
31
IV Spark with Hadoop
1 Evolution2 Transition3 Integration4 Complementarity5 Key Takeaways
32
2 Transitionbull Existing Hadoop MapReduce projects can migrate to Spark and leverage Spark Core as execution engine1 You can often reuse your mapper and
reducer functions and just call them in Spark from Java or Scala
2 You can translate your code from MapReduce to Apache Spark How-to Translate from MapReduce to Apache Spark
httpblogclouderacomblog201409how-to-translate-from-mapreduce-to-apache-spark
33
2 Transition3 The following tools originally based on Hadoop MapReduce are being ported to Apache Spark
bull Pig Hive Sqoop Cascading Crunch Mahout hellip
34
Pig on Spark (Spork)bull Run Pig with ldquondashx sparkrdquo option for an easy migration
without development effortbull Speed up your existing pig scripts on Spark ( Query
Logical Plan Physical Pan)bull Leverage new Spark specific operators in Pig such as
Cachebull Still leverage many existing Pig UDF librariesbull Pig on Spark Umbrella Jira (Status Passed end-to-end test
cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059bull Fix outstanding issues and address additional Spark functionality
through the community
bull lsquoPig on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag19
35
Hive on Spark (Currently in Beta Expected in Hive 110)
bull New alternative to using MapReduce or Tez hivegt set hiveexecutionengine=sparkbull Help existing Hive applications running on
MapReduce or Tez easily migrate to Spark without development effort
bull Exposes Spark users to a viable feature-rich de facto standard SQL tool on Hadoop
bull Performance benefits especially for Hive queries involving multiple reducer stages
bull Hive on Spark Umbrella Jira (Status Open) Q1 2015 httpsissuesapacheorgjirabrowseHIVE-7292
36
Hive on Spark (Currently in Beta Expected in Hive 110)
bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-motivations-and-design-principles
bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Started
bull Hive on Spark February 11 2015 Szehon Ho Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark
bull Hive on spark is blazing fast or is it Carter Shanklin and Mostapah Mokhtar (Hortonworks) February 20 2015 httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final
bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12
37
Sqoop on Spark (Expected in Sqoop 2)
bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop
bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources
bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal
bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532
38
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop
bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark
39
Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml
40
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
41
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
42
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
43
3 IntegrationService Open Source Tool
StorageServing Layer
Data Formats
Data Ingestion ServicesResource Management
Search
SQL
44
3 Integrationbull Spark was designed to read and write data from and to HDFS as
well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851
45
3 Integrationbull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase
46
3 Integration bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra
47
3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new avenues
bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra
48
3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo
httpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights
bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
49
3 Integration bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu
ction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
50
3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph
databasebull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
51
3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A
bull Some issues are critical ones bull Running Spark on YARN
httpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
52
3 Integrationbull Spark SQL provides built in support for Hive tables
bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib
53
3 Integration bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline
Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
54
3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka
55
3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
56
3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data
bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml
57
3 Integrationbull Apache Parquet is a columnar storage format available
to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
58
3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark
SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem
bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format
59
3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
60
3 Integration bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html
61
3 Integration
bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from
MapReduce to Sparkbull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed
62
3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014
63
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
64
4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them
Hadoop ecosystem Spark ecosystem
65
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
66
4 Complementarity + bull Mesos and YARN can work together each for what
it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom
httpsparkbigdatacomcomponenttagstag41
67
4 Complementarity + References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management
bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620
68
4 Complementarity + bull Spark on Tez for efficient ETL https
githubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)
bull Tez supports enterprise security
69
4 Complementarity + bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
70
4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml
71
4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop
72
5 Key Takeaways1 Evolution of compute models is still ongoing Watch
out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
73
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
74
1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt
he-perfect-match-apache-spark-meets-swift
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
27
1 Evolution Apache Flinkbull Flink German for ldquonimble swift speedyrdquobull This is how Apache Flink is branding itself ldquoFast and
reliable large-scale data processing enginerdquobull Apache Flink httpflinkapacheorg offers
bull Batch and Streaming in the same systembull Beyond DAGs (Cyclic operator graphs)bull Powerful expressive APIsbull Inside-the-system iterationsbull Full Hadoop compatibility bull Automatic language independent optimizer
bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink
28
Hadoop MapReduce vs Tez vs SparkCriteria
License Open SourceApache 20 version 2x
Open SourceApache 20 version 0x
Open SourceApache 20 version 1x
Processing Model
On-Disk (Disk- based parallelization) Batch
On-Disk Batch Interactive
In-Memory On-Disk Batch Interactive Streaming (Near Real-Time)
Language written in
Java Java Scala
API [Java Python Scala] User-Facing
Java[ ISVEngineTool builder]
[Scala Java Python] User-Facing
Libraries None separate tools None [Spark Core Spark Streaming Spark SQL MLlib GraphX]
29
Hadoop MapReduce vs Tez vs SparkCriteria
Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to Hadoop
Ease of Use Difficult to program needs abstractions
No Interactive mode except Hive Pig
Difficult to program
No Interactive mode except Hive Pig
Easy to program no need of abstractionsInteractive mode
Compatibility
to data types and data sources is same
to data types and data sources is same
to data types and data sources is same
YARN integration
YARN application Ground up YARN application
Spark is moving towards YARN
30
Hadoop MapReduce vs Tez vs SparkCriteria
Deployment YARN YARN [Standalone YARN SIMR Mesos hellip]
Performance - Good performance when data fits into memory
- performance degradation otherwise
Security More features and projects
More features and projects
Still in its infancy
Partial support
31
IV Spark with Hadoop
1 Evolution2 Transition3 Integration4 Complementarity5 Key Takeaways
32
2 Transitionbull Existing Hadoop MapReduce projects can migrate to Spark and leverage Spark Core as execution engine1 You can often reuse your mapper and
reducer functions and just call them in Spark from Java or Scala
2 You can translate your code from MapReduce to Apache Spark How-to Translate from MapReduce to Apache Spark
httpblogclouderacomblog201409how-to-translate-from-mapreduce-to-apache-spark
33
2 Transition3 The following tools originally based on Hadoop MapReduce are being ported to Apache Spark
bull Pig Hive Sqoop Cascading Crunch Mahout hellip
34
Pig on Spark (Spork)bull Run Pig with ldquondashx sparkrdquo option for an easy migration
without development effortbull Speed up your existing pig scripts on Spark ( Query
Logical Plan Physical Pan)bull Leverage new Spark specific operators in Pig such as
Cachebull Still leverage many existing Pig UDF librariesbull Pig on Spark Umbrella Jira (Status Passed end-to-end test
cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059bull Fix outstanding issues and address additional Spark functionality
through the community
bull lsquoPig on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag19
35
Hive on Spark (Currently in Beta Expected in Hive 110)
bull New alternative to using MapReduce or Tez hivegt set hiveexecutionengine=sparkbull Help existing Hive applications running on
MapReduce or Tez easily migrate to Spark without development effort
bull Exposes Spark users to a viable feature-rich de facto standard SQL tool on Hadoop
bull Performance benefits especially for Hive queries involving multiple reducer stages
bull Hive on Spark Umbrella Jira (Status Open) Q1 2015 httpsissuesapacheorgjirabrowseHIVE-7292
36
Hive on Spark (Currently in Beta Expected in Hive 110)
bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-motivations-and-design-principles
bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Started
bull Hive on Spark February 11 2015 Szehon Ho Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark
bull Hive on spark is blazing fast or is it Carter Shanklin and Mostapah Mokhtar (Hortonworks) February 20 2015 httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final
bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12
37
Sqoop on Spark (Expected in Sqoop 2)
bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop
bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources
bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal
bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532
38
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop
bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark
39
Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml
40
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
41
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
42
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
43
3 IntegrationService Open Source Tool
StorageServing Layer
Data Formats
Data Ingestion ServicesResource Management
Search
SQL
44
3 Integrationbull Spark was designed to read and write data from and to HDFS as
well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851
45
3 Integrationbull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase
46
3 Integration bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra
47
3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new avenues
bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra
48
3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo
httpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights
bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
49
3 Integration bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu
ction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
50
3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph
databasebull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
51
3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A
bull Some issues are critical ones bull Running Spark on YARN
httpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
52
3 Integrationbull Spark SQL provides built in support for Hive tables
bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib
53
3 Integration bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline
Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
54
3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka
55
3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
56
3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data
bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml
57
3 Integrationbull Apache Parquet is a columnar storage format available
to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
58
3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark
SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem
bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format
59
3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
60
3 Integration bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html
61
3 Integration
bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from
MapReduce to Sparkbull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed
62
3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014
63
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
64
4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them
Hadoop ecosystem Spark ecosystem
65
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
66
4 Complementarity + bull Mesos and YARN can work together each for what
it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom
httpsparkbigdatacomcomponenttagstag41
67
4 Complementarity + References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management
bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620
68
4 Complementarity + bull Spark on Tez for efficient ETL https
githubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)
bull Tez supports enterprise security
69
4 Complementarity + bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
70
4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml
71
4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop
72
5 Key Takeaways1 Evolution of compute models is still ongoing Watch
out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
73
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
74
1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt
he-perfect-match-apache-spark-meets-swift
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
28
Hadoop MapReduce vs Tez vs SparkCriteria
License Open SourceApache 20 version 2x
Open SourceApache 20 version 0x
Open SourceApache 20 version 1x
Processing Model
On-Disk (Disk- based parallelization) Batch
On-Disk Batch Interactive
In-Memory On-Disk Batch Interactive Streaming (Near Real-Time)
Language written in
Java Java Scala
API [Java Python Scala] User-Facing
Java[ ISVEngineTool builder]
[Scala Java Python] User-Facing
Libraries None separate tools None [Spark Core Spark Streaming Spark SQL MLlib GraphX]
29
Hadoop MapReduce vs Tez vs SparkCriteria
Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to Hadoop
Ease of Use Difficult to program needs abstractions
No Interactive mode except Hive Pig
Difficult to program
No Interactive mode except Hive Pig
Easy to program no need of abstractionsInteractive mode
Compatibility
to data types and data sources is same
to data types and data sources is same
to data types and data sources is same
YARN integration
YARN application Ground up YARN application
Spark is moving towards YARN
30
Hadoop MapReduce vs Tez vs SparkCriteria
Deployment YARN YARN [Standalone YARN SIMR Mesos hellip]
Performance - Good performance when data fits into memory
- performance degradation otherwise
Security More features and projects
More features and projects
Still in its infancy
Partial support
31
IV Spark with Hadoop
1 Evolution2 Transition3 Integration4 Complementarity5 Key Takeaways
32
2 Transitionbull Existing Hadoop MapReduce projects can migrate to Spark and leverage Spark Core as execution engine1 You can often reuse your mapper and
reducer functions and just call them in Spark from Java or Scala
2 You can translate your code from MapReduce to Apache Spark How-to Translate from MapReduce to Apache Spark
httpblogclouderacomblog201409how-to-translate-from-mapreduce-to-apache-spark
33
2 Transition3 The following tools originally based on Hadoop MapReduce are being ported to Apache Spark
bull Pig Hive Sqoop Cascading Crunch Mahout hellip
34
Pig on Spark (Spork)bull Run Pig with ldquondashx sparkrdquo option for an easy migration
without development effortbull Speed up your existing pig scripts on Spark ( Query
Logical Plan Physical Pan)bull Leverage new Spark specific operators in Pig such as
Cachebull Still leverage many existing Pig UDF librariesbull Pig on Spark Umbrella Jira (Status Passed end-to-end test
cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059bull Fix outstanding issues and address additional Spark functionality
through the community
bull lsquoPig on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag19
35
Hive on Spark (Currently in Beta Expected in Hive 110)
bull New alternative to using MapReduce or Tez hivegt set hiveexecutionengine=sparkbull Help existing Hive applications running on
MapReduce or Tez easily migrate to Spark without development effort
bull Exposes Spark users to a viable feature-rich de facto standard SQL tool on Hadoop
bull Performance benefits especially for Hive queries involving multiple reducer stages
bull Hive on Spark Umbrella Jira (Status Open) Q1 2015 httpsissuesapacheorgjirabrowseHIVE-7292
36
Hive on Spark (Currently in Beta Expected in Hive 110)
bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-motivations-and-design-principles
bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Started
bull Hive on Spark February 11 2015 Szehon Ho Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark
bull Hive on spark is blazing fast or is it Carter Shanklin and Mostapah Mokhtar (Hortonworks) February 20 2015 httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final
bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12
37
Sqoop on Spark (Expected in Sqoop 2)
bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop
bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources
bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal
bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532
38
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop
bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark
39
Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml
40
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
41
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
42
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
43
3 IntegrationService Open Source Tool
StorageServing Layer
Data Formats
Data Ingestion ServicesResource Management
Search
SQL
44
3 Integrationbull Spark was designed to read and write data from and to HDFS as
well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851
45
3 Integrationbull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase
46
3 Integration bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra
47
3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new avenues
bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra
48
3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo
httpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights
bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
49
3 Integration bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu
ction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
50
3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph
databasebull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
51
3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A
bull Some issues are critical ones bull Running Spark on YARN
httpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
52
3 Integrationbull Spark SQL provides built in support for Hive tables
bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib
53
3 Integration bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline
Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
54
3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka
55
3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
56
3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data
bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml
57
3 Integrationbull Apache Parquet is a columnar storage format available
to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
58
3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark
SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem
bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format
59
3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
60
3 Integration bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html
61
3 Integration
bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from
MapReduce to Sparkbull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed
62
3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014
63
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
64
4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them
Hadoop ecosystem Spark ecosystem
65
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
66
4 Complementarity + bull Mesos and YARN can work together each for what
it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom
httpsparkbigdatacomcomponenttagstag41
67
4 Complementarity + References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management
bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620
68
4 Complementarity + bull Spark on Tez for efficient ETL https
githubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)
bull Tez supports enterprise security
69
4 Complementarity + bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
70
4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml
71
4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop
72
5 Key Takeaways1 Evolution of compute models is still ongoing Watch
out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
73
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
74
1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt
he-perfect-match-apache-spark-meets-swift
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
29
Hadoop MapReduce vs Tez vs SparkCriteria
Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to Hadoop
Ease of Use Difficult to program needs abstractions
No Interactive mode except Hive Pig
Difficult to program
No Interactive mode except Hive Pig
Easy to program no need of abstractionsInteractive mode
Compatibility
to data types and data sources is same
to data types and data sources is same
to data types and data sources is same
YARN integration
YARN application Ground up YARN application
Spark is moving towards YARN
30
Hadoop MapReduce vs Tez vs SparkCriteria
Deployment YARN YARN [Standalone YARN SIMR Mesos hellip]
Performance - Good performance when data fits into memory
- performance degradation otherwise
Security More features and projects
More features and projects
Still in its infancy
Partial support
31
IV Spark with Hadoop
1 Evolution2 Transition3 Integration4 Complementarity5 Key Takeaways
32
2 Transitionbull Existing Hadoop MapReduce projects can migrate to Spark and leverage Spark Core as execution engine1 You can often reuse your mapper and
reducer functions and just call them in Spark from Java or Scala
2 You can translate your code from MapReduce to Apache Spark How-to Translate from MapReduce to Apache Spark
httpblogclouderacomblog201409how-to-translate-from-mapreduce-to-apache-spark
33
2 Transition3 The following tools originally based on Hadoop MapReduce are being ported to Apache Spark
bull Pig Hive Sqoop Cascading Crunch Mahout hellip
34
Pig on Spark (Spork)bull Run Pig with ldquondashx sparkrdquo option for an easy migration
without development effortbull Speed up your existing pig scripts on Spark ( Query
Logical Plan Physical Pan)bull Leverage new Spark specific operators in Pig such as
Cachebull Still leverage many existing Pig UDF librariesbull Pig on Spark Umbrella Jira (Status Passed end-to-end test
cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059bull Fix outstanding issues and address additional Spark functionality
through the community
bull lsquoPig on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag19
35
Hive on Spark (Currently in Beta Expected in Hive 110)
bull New alternative to using MapReduce or Tez hivegt set hiveexecutionengine=sparkbull Help existing Hive applications running on
MapReduce or Tez easily migrate to Spark without development effort
bull Exposes Spark users to a viable feature-rich de facto standard SQL tool on Hadoop
bull Performance benefits especially for Hive queries involving multiple reducer stages
bull Hive on Spark Umbrella Jira (Status Open) Q1 2015 httpsissuesapacheorgjirabrowseHIVE-7292
36
Hive on Spark (Currently in Beta Expected in Hive 110)
bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-motivations-and-design-principles
bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Started
bull Hive on Spark February 11 2015 Szehon Ho Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark
bull Hive on spark is blazing fast or is it Carter Shanklin and Mostapah Mokhtar (Hortonworks) February 20 2015 httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final
bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12
37
Sqoop on Spark (Expected in Sqoop 2)
bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop
bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources
bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal
bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532
38
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop
bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark
39
Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml
40
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
41
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
42
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
43
3 IntegrationService Open Source Tool
StorageServing Layer
Data Formats
Data Ingestion ServicesResource Management
Search
SQL
44
3 Integrationbull Spark was designed to read and write data from and to HDFS as
well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851
45
3 Integrationbull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase
46
3 Integration bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra
47
3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new avenues
bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra
48
3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo
httpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights
bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
49
3 Integration bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu
ction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
50
3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph
databasebull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
51
3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A
bull Some issues are critical ones bull Running Spark on YARN
httpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
52
3 Integrationbull Spark SQL provides built in support for Hive tables
bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib
53
3 Integration bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline
Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
54
3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka
55
3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
56
3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data
bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml
57
3 Integrationbull Apache Parquet is a columnar storage format available
to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
58
3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark
SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem
bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format
59
3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
60
3 Integration bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html
61
3 Integration
bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from
MapReduce to Sparkbull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed
62
3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014
63
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
64
4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them
Hadoop ecosystem Spark ecosystem
65
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
66
4 Complementarity + bull Mesos and YARN can work together each for what
it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom
httpsparkbigdatacomcomponenttagstag41
67
4 Complementarity + References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management
bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620
68
4 Complementarity + bull Spark on Tez for efficient ETL https
githubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)
bull Tez supports enterprise security
69
4 Complementarity + bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
70
4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml
71
4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop
72
5 Key Takeaways1 Evolution of compute models is still ongoing Watch
out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
73
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
74
1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt
he-perfect-match-apache-spark-meets-swift
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
30
Hadoop MapReduce vs Tez vs SparkCriteria
Deployment YARN YARN [Standalone YARN SIMR Mesos hellip]
Performance - Good performance when data fits into memory
- performance degradation otherwise
Security More features and projects
More features and projects
Still in its infancy
Partial support
31
IV Spark with Hadoop
1 Evolution2 Transition3 Integration4 Complementarity5 Key Takeaways
32
2 Transitionbull Existing Hadoop MapReduce projects can migrate to Spark and leverage Spark Core as execution engine1 You can often reuse your mapper and
reducer functions and just call them in Spark from Java or Scala
2 You can translate your code from MapReduce to Apache Spark How-to Translate from MapReduce to Apache Spark
httpblogclouderacomblog201409how-to-translate-from-mapreduce-to-apache-spark
33
2 Transition3 The following tools originally based on Hadoop MapReduce are being ported to Apache Spark
bull Pig Hive Sqoop Cascading Crunch Mahout hellip
34
Pig on Spark (Spork)bull Run Pig with ldquondashx sparkrdquo option for an easy migration
without development effortbull Speed up your existing pig scripts on Spark ( Query
Logical Plan Physical Pan)bull Leverage new Spark specific operators in Pig such as
Cachebull Still leverage many existing Pig UDF librariesbull Pig on Spark Umbrella Jira (Status Passed end-to-end test
cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059bull Fix outstanding issues and address additional Spark functionality
through the community
bull lsquoPig on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag19
35
Hive on Spark (Currently in Beta Expected in Hive 110)
bull New alternative to using MapReduce or Tez hivegt set hiveexecutionengine=sparkbull Help existing Hive applications running on
MapReduce or Tez easily migrate to Spark without development effort
bull Exposes Spark users to a viable feature-rich de facto standard SQL tool on Hadoop
bull Performance benefits especially for Hive queries involving multiple reducer stages
bull Hive on Spark Umbrella Jira (Status Open) Q1 2015 httpsissuesapacheorgjirabrowseHIVE-7292
36
Hive on Spark (Currently in Beta Expected in Hive 110)
bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-motivations-and-design-principles
bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Started
bull Hive on Spark February 11 2015 Szehon Ho Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark
bull Hive on spark is blazing fast or is it Carter Shanklin and Mostapah Mokhtar (Hortonworks) February 20 2015 httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final
bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12
37
Sqoop on Spark (Expected in Sqoop 2)
bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop
bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources
bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal
bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532
38
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop
bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark
39
Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml
40
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
41
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
42
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
43
3 IntegrationService Open Source Tool
StorageServing Layer
Data Formats
Data Ingestion ServicesResource Management
Search
SQL
44
3 Integrationbull Spark was designed to read and write data from and to HDFS as
well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851
45
3 Integrationbull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase
46
3 Integration bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra
47
3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new avenues
bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra
48
3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo
httpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights
bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
49
3 Integration bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu
ction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
50
3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph
databasebull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
51
3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A
bull Some issues are critical ones bull Running Spark on YARN
httpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
52
3 Integrationbull Spark SQL provides built in support for Hive tables
bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib
53
3 Integration bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline
Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
54
3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka
55
3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
56
3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data
bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml
57
3 Integrationbull Apache Parquet is a columnar storage format available
to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
58
3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark
SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem
bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format
59
3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
60
3 Integration bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html
61
3 Integration
bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from
MapReduce to Sparkbull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed
62
3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014
63
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
64
4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them
Hadoop ecosystem Spark ecosystem
65
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
66
4 Complementarity + bull Mesos and YARN can work together each for what
it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom
httpsparkbigdatacomcomponenttagstag41
67
4 Complementarity + References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management
bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620
68
4 Complementarity + bull Spark on Tez for efficient ETL https
githubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)
bull Tez supports enterprise security
69
4 Complementarity + bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
70
4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml
71
4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop
72
5 Key Takeaways1 Evolution of compute models is still ongoing Watch
out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
73
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
74
1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt
he-perfect-match-apache-spark-meets-swift
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
31
IV Spark with Hadoop
1 Evolution2 Transition3 Integration4 Complementarity5 Key Takeaways
32
2 Transitionbull Existing Hadoop MapReduce projects can migrate to Spark and leverage Spark Core as execution engine1 You can often reuse your mapper and
reducer functions and just call them in Spark from Java or Scala
2 You can translate your code from MapReduce to Apache Spark How-to Translate from MapReduce to Apache Spark
httpblogclouderacomblog201409how-to-translate-from-mapreduce-to-apache-spark
33
2 Transition3 The following tools originally based on Hadoop MapReduce are being ported to Apache Spark
bull Pig Hive Sqoop Cascading Crunch Mahout hellip
34
Pig on Spark (Spork)bull Run Pig with ldquondashx sparkrdquo option for an easy migration
without development effortbull Speed up your existing pig scripts on Spark ( Query
Logical Plan Physical Pan)bull Leverage new Spark specific operators in Pig such as
Cachebull Still leverage many existing Pig UDF librariesbull Pig on Spark Umbrella Jira (Status Passed end-to-end test
cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059bull Fix outstanding issues and address additional Spark functionality
through the community
bull lsquoPig on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag19
35
Hive on Spark (Currently in Beta Expected in Hive 110)
bull New alternative to using MapReduce or Tez hivegt set hiveexecutionengine=sparkbull Help existing Hive applications running on
MapReduce or Tez easily migrate to Spark without development effort
bull Exposes Spark users to a viable feature-rich de facto standard SQL tool on Hadoop
bull Performance benefits especially for Hive queries involving multiple reducer stages
bull Hive on Spark Umbrella Jira (Status Open) Q1 2015 httpsissuesapacheorgjirabrowseHIVE-7292
36
Hive on Spark (Currently in Beta Expected in Hive 110)
bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-motivations-and-design-principles
bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Started
bull Hive on Spark February 11 2015 Szehon Ho Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark
bull Hive on spark is blazing fast or is it Carter Shanklin and Mostapah Mokhtar (Hortonworks) February 20 2015 httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final
bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12
37
Sqoop on Spark (Expected in Sqoop 2)
bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop
bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources
bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal
bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532
38
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop
bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark
39
Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml
40
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
41
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
42
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
43
3 IntegrationService Open Source Tool
StorageServing Layer
Data Formats
Data Ingestion ServicesResource Management
Search
SQL
44
3 Integrationbull Spark was designed to read and write data from and to HDFS as
well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851
45
3 Integrationbull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase
46
3 Integration bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra
47
3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new avenues
bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra
48
3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo
httpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights
bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
49
3 Integration bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu
ction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
50
3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph
databasebull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
51
3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A
bull Some issues are critical ones bull Running Spark on YARN
httpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
52
3 Integrationbull Spark SQL provides built in support for Hive tables
bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib
53
3 Integration bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline
Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
54
3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka
55
3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
56
3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data
bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml
57
3 Integrationbull Apache Parquet is a columnar storage format available
to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
58
3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark
SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem
bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format
59
3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
60
3 Integration bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html
61
3 Integration
bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from
MapReduce to Sparkbull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed
62
3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014
63
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
64
4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them
Hadoop ecosystem Spark ecosystem
65
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
66
4 Complementarity + bull Mesos and YARN can work together each for what
it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom
httpsparkbigdatacomcomponenttagstag41
67
4 Complementarity + References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management
bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620
68
4 Complementarity + bull Spark on Tez for efficient ETL https
githubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)
bull Tez supports enterprise security
69
4 Complementarity + bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
70
4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml
71
4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop
72
5 Key Takeaways1 Evolution of compute models is still ongoing Watch
out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
73
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
74
1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt
he-perfect-match-apache-spark-meets-swift
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
32
2 Transitionbull Existing Hadoop MapReduce projects can migrate to Spark and leverage Spark Core as execution engine1 You can often reuse your mapper and
reducer functions and just call them in Spark from Java or Scala
2 You can translate your code from MapReduce to Apache Spark How-to Translate from MapReduce to Apache Spark
httpblogclouderacomblog201409how-to-translate-from-mapreduce-to-apache-spark
33
2 Transition3 The following tools originally based on Hadoop MapReduce are being ported to Apache Spark
bull Pig Hive Sqoop Cascading Crunch Mahout hellip
34
Pig on Spark (Spork)bull Run Pig with ldquondashx sparkrdquo option for an easy migration
without development effortbull Speed up your existing pig scripts on Spark ( Query
Logical Plan Physical Pan)bull Leverage new Spark specific operators in Pig such as
Cachebull Still leverage many existing Pig UDF librariesbull Pig on Spark Umbrella Jira (Status Passed end-to-end test
cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059bull Fix outstanding issues and address additional Spark functionality
through the community
bull lsquoPig on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag19
35
Hive on Spark (Currently in Beta Expected in Hive 110)
bull New alternative to using MapReduce or Tez hivegt set hiveexecutionengine=sparkbull Help existing Hive applications running on
MapReduce or Tez easily migrate to Spark without development effort
bull Exposes Spark users to a viable feature-rich de facto standard SQL tool on Hadoop
bull Performance benefits especially for Hive queries involving multiple reducer stages
bull Hive on Spark Umbrella Jira (Status Open) Q1 2015 httpsissuesapacheorgjirabrowseHIVE-7292
36
Hive on Spark (Currently in Beta Expected in Hive 110)
bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-motivations-and-design-principles
bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Started
bull Hive on Spark February 11 2015 Szehon Ho Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark
bull Hive on spark is blazing fast or is it Carter Shanklin and Mostapah Mokhtar (Hortonworks) February 20 2015 httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final
bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12
37
Sqoop on Spark (Expected in Sqoop 2)
bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop
bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources
bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal
bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532
38
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop
bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark
39
Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml
40
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
41
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
42
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
43
3 IntegrationService Open Source Tool
StorageServing Layer
Data Formats
Data Ingestion ServicesResource Management
Search
SQL
44
3 Integrationbull Spark was designed to read and write data from and to HDFS as
well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851
45
3 Integrationbull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase
46
3 Integration bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra
47
3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new avenues
bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra
48
3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo
httpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights
bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
49
3 Integration bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu
ction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
50
3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph
databasebull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
51
3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A
bull Some issues are critical ones bull Running Spark on YARN
httpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
52
3 Integrationbull Spark SQL provides built in support for Hive tables
bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib
53
3 Integration bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline
Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
54
3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka
55
3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
56
3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data
bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml
57
3 Integrationbull Apache Parquet is a columnar storage format available
to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
58
3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark
SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem
bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format
59
3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
60
3 Integration bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html
61
3 Integration
bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from
MapReduce to Sparkbull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed
62
3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014
63
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
64
4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them
Hadoop ecosystem Spark ecosystem
65
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
66
4 Complementarity + bull Mesos and YARN can work together each for what
it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom
httpsparkbigdatacomcomponenttagstag41
67
4 Complementarity + References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management
bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620
68
4 Complementarity + bull Spark on Tez for efficient ETL https
githubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)
bull Tez supports enterprise security
69
4 Complementarity + bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
70
4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml
71
4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop
72
5 Key Takeaways1 Evolution of compute models is still ongoing Watch
out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
73
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
74
1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt
he-perfect-match-apache-spark-meets-swift
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
33
2 Transition3 The following tools originally based on Hadoop MapReduce are being ported to Apache Spark
bull Pig Hive Sqoop Cascading Crunch Mahout hellip
34
Pig on Spark (Spork)bull Run Pig with ldquondashx sparkrdquo option for an easy migration
without development effortbull Speed up your existing pig scripts on Spark ( Query
Logical Plan Physical Pan)bull Leverage new Spark specific operators in Pig such as
Cachebull Still leverage many existing Pig UDF librariesbull Pig on Spark Umbrella Jira (Status Passed end-to-end test
cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059bull Fix outstanding issues and address additional Spark functionality
through the community
bull lsquoPig on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag19
35
Hive on Spark (Currently in Beta Expected in Hive 110)
bull New alternative to using MapReduce or Tez hivegt set hiveexecutionengine=sparkbull Help existing Hive applications running on
MapReduce or Tez easily migrate to Spark without development effort
bull Exposes Spark users to a viable feature-rich de facto standard SQL tool on Hadoop
bull Performance benefits especially for Hive queries involving multiple reducer stages
bull Hive on Spark Umbrella Jira (Status Open) Q1 2015 httpsissuesapacheorgjirabrowseHIVE-7292
36
Hive on Spark (Currently in Beta Expected in Hive 110)
bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-motivations-and-design-principles
bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Started
bull Hive on Spark February 11 2015 Szehon Ho Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark
bull Hive on spark is blazing fast or is it Carter Shanklin and Mostapah Mokhtar (Hortonworks) February 20 2015 httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final
bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12
37
Sqoop on Spark (Expected in Sqoop 2)
bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop
bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources
bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal
bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532
38
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop
bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark
39
Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml
40
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
41
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
42
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
43
3 IntegrationService Open Source Tool
StorageServing Layer
Data Formats
Data Ingestion ServicesResource Management
Search
SQL
44
3 Integrationbull Spark was designed to read and write data from and to HDFS as
well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851
45
3 Integrationbull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase
46
3 Integration bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra
47
3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new avenues
bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra
48
3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo
httpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights
bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
49
3 Integration bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu
ction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
50
3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph
databasebull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
51
3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A
bull Some issues are critical ones bull Running Spark on YARN
httpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
52
3 Integrationbull Spark SQL provides built in support for Hive tables
bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib
53
3 Integration bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline
Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
54
3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka
55
3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
56
3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data
bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml
57
3 Integrationbull Apache Parquet is a columnar storage format available
to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
58
3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark
SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem
bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format
59
3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
60
3 Integration bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html
61
3 Integration
bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from
MapReduce to Sparkbull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed
62
3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014
63
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
64
4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them
Hadoop ecosystem Spark ecosystem
65
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
66
4 Complementarity + bull Mesos and YARN can work together each for what
it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom
httpsparkbigdatacomcomponenttagstag41
67
4 Complementarity + References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management
bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620
68
4 Complementarity + bull Spark on Tez for efficient ETL https
githubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)
bull Tez supports enterprise security
69
4 Complementarity + bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
70
4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml
71
4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop
72
5 Key Takeaways1 Evolution of compute models is still ongoing Watch
out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
73
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
74
1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt
he-perfect-match-apache-spark-meets-swift
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
34
Pig on Spark (Spork)bull Run Pig with ldquondashx sparkrdquo option for an easy migration
without development effortbull Speed up your existing pig scripts on Spark ( Query
Logical Plan Physical Pan)bull Leverage new Spark specific operators in Pig such as
Cachebull Still leverage many existing Pig UDF librariesbull Pig on Spark Umbrella Jira (Status Passed end-to-end test
cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059bull Fix outstanding issues and address additional Spark functionality
through the community
bull lsquoPig on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag19
35
Hive on Spark (Currently in Beta Expected in Hive 110)
bull New alternative to using MapReduce or Tez hivegt set hiveexecutionengine=sparkbull Help existing Hive applications running on
MapReduce or Tez easily migrate to Spark without development effort
bull Exposes Spark users to a viable feature-rich de facto standard SQL tool on Hadoop
bull Performance benefits especially for Hive queries involving multiple reducer stages
bull Hive on Spark Umbrella Jira (Status Open) Q1 2015 httpsissuesapacheorgjirabrowseHIVE-7292
36
Hive on Spark (Currently in Beta Expected in Hive 110)
bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-motivations-and-design-principles
bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Started
bull Hive on Spark February 11 2015 Szehon Ho Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark
bull Hive on spark is blazing fast or is it Carter Shanklin and Mostapah Mokhtar (Hortonworks) February 20 2015 httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final
bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12
37
Sqoop on Spark (Expected in Sqoop 2)
bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop
bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources
bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal
bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532
38
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop
bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark
39
Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml
40
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
41
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
42
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
43
3 IntegrationService Open Source Tool
StorageServing Layer
Data Formats
Data Ingestion ServicesResource Management
Search
SQL
44
3 Integrationbull Spark was designed to read and write data from and to HDFS as
well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851
45
3 Integrationbull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase
46
3 Integration bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra
47
3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new avenues
bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra
48
3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo
httpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights
bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
49
3 Integration bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu
ction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
50
3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph
databasebull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
51
3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A
bull Some issues are critical ones bull Running Spark on YARN
httpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
52
3 Integrationbull Spark SQL provides built in support for Hive tables
bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib
53
3 Integration bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline
Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
54
3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka
55
3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
56
3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data
bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml
57
3 Integrationbull Apache Parquet is a columnar storage format available
to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
58
3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark
SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem
bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format
59
3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
60
3 Integration bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html
61
3 Integration
bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from
MapReduce to Sparkbull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed
62
3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014
63
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
64
4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them
Hadoop ecosystem Spark ecosystem
65
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
66
4 Complementarity + bull Mesos and YARN can work together each for what
it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom
httpsparkbigdatacomcomponenttagstag41
67
4 Complementarity + References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management
bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620
68
4 Complementarity + bull Spark on Tez for efficient ETL https
githubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)
bull Tez supports enterprise security
69
4 Complementarity + bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
70
4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml
71
4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop
72
5 Key Takeaways1 Evolution of compute models is still ongoing Watch
out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
73
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
74
1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt
he-perfect-match-apache-spark-meets-swift
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
35
Hive on Spark (Currently in Beta Expected in Hive 110)
bull New alternative to using MapReduce or Tez hivegt set hiveexecutionengine=sparkbull Help existing Hive applications running on
MapReduce or Tez easily migrate to Spark without development effort
bull Exposes Spark users to a viable feature-rich de facto standard SQL tool on Hadoop
bull Performance benefits especially for Hive queries involving multiple reducer stages
bull Hive on Spark Umbrella Jira (Status Open) Q1 2015 httpsissuesapacheorgjirabrowseHIVE-7292
36
Hive on Spark (Currently in Beta Expected in Hive 110)
bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-motivations-and-design-principles
bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Started
bull Hive on Spark February 11 2015 Szehon Ho Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark
bull Hive on spark is blazing fast or is it Carter Shanklin and Mostapah Mokhtar (Hortonworks) February 20 2015 httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final
bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12
37
Sqoop on Spark (Expected in Sqoop 2)
bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop
bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources
bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal
bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532
38
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop
bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark
39
Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml
40
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
41
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
42
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
43
3 IntegrationService Open Source Tool
StorageServing Layer
Data Formats
Data Ingestion ServicesResource Management
Search
SQL
44
3 Integrationbull Spark was designed to read and write data from and to HDFS as
well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851
45
3 Integrationbull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase
46
3 Integration bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra
47
3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new avenues
bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra
48
3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo
httpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights
bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
49
3 Integration bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu
ction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
50
3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph
databasebull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
51
3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A
bull Some issues are critical ones bull Running Spark on YARN
httpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
52
3 Integrationbull Spark SQL provides built in support for Hive tables
bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib
53
3 Integration bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline
Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
54
3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka
55
3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
56
3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data
bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml
57
3 Integrationbull Apache Parquet is a columnar storage format available
to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
58
3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark
SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem
bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format
59
3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
60
3 Integration bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html
61
3 Integration
bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from
MapReduce to Sparkbull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed
62
3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014
63
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
64
4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them
Hadoop ecosystem Spark ecosystem
65
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
66
4 Complementarity + bull Mesos and YARN can work together each for what
it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom
httpsparkbigdatacomcomponenttagstag41
67
4 Complementarity + References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management
bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620
68
4 Complementarity + bull Spark on Tez for efficient ETL https
githubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)
bull Tez supports enterprise security
69
4 Complementarity + bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
70
4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml
71
4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop
72
5 Key Takeaways1 Evolution of compute models is still ongoing Watch
out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
73
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
74
1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt
he-perfect-match-apache-spark-meets-swift
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
36
Hive on Spark (Currently in Beta Expected in Hive 110)
bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-motivations-and-design-principles
bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Started
bull Hive on Spark February 11 2015 Szehon Ho Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark
bull Hive on spark is blazing fast or is it Carter Shanklin and Mostapah Mokhtar (Hortonworks) February 20 2015 httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final
bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12
37
Sqoop on Spark (Expected in Sqoop 2)
bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop
bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources
bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal
bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532
38
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop
bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark
39
Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml
40
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
41
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
42
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
43
3 IntegrationService Open Source Tool
StorageServing Layer
Data Formats
Data Ingestion ServicesResource Management
Search
SQL
44
3 Integrationbull Spark was designed to read and write data from and to HDFS as
well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851
45
3 Integrationbull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase
46
3 Integration bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra
47
3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new avenues
bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra
48
3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo
httpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights
bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
49
3 Integration bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu
ction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
50
3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph
databasebull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
51
3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A
bull Some issues are critical ones bull Running Spark on YARN
httpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
52
3 Integrationbull Spark SQL provides built in support for Hive tables
bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib
53
3 Integration bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline
Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
54
3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka
55
3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
56
3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data
bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml
57
3 Integrationbull Apache Parquet is a columnar storage format available
to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
58
3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark
SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem
bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format
59
3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
60
3 Integration bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html
61
3 Integration
bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from
MapReduce to Sparkbull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed
62
3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014
63
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
64
4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them
Hadoop ecosystem Spark ecosystem
65
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
66
4 Complementarity + bull Mesos and YARN can work together each for what
it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom
httpsparkbigdatacomcomponenttagstag41
67
4 Complementarity + References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management
bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620
68
4 Complementarity + bull Spark on Tez for efficient ETL https
githubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)
bull Tez supports enterprise security
69
4 Complementarity + bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
70
4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml
71
4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop
72
5 Key Takeaways1 Evolution of compute models is still ongoing Watch
out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
73
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
74
1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt
he-perfect-match-apache-spark-meets-swift
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
37
Sqoop on Spark (Expected in Sqoop 2)
bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop
bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources
bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal
bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532
38
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop
bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark
39
Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml
40
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
41
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
42
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
43
3 IntegrationService Open Source Tool
StorageServing Layer
Data Formats
Data Ingestion ServicesResource Management
Search
SQL
44
3 Integrationbull Spark was designed to read and write data from and to HDFS as
well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851
45
3 Integrationbull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase
46
3 Integration bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra
47
3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new avenues
bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra
48
3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo
httpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights
bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
49
3 Integration bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu
ction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
50
3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph
databasebull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
51
3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A
bull Some issues are critical ones bull Running Spark on YARN
httpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
52
3 Integrationbull Spark SQL provides built in support for Hive tables
bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib
53
3 Integration bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline
Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
54
3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka
55
3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
56
3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data
bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml
57
3 Integrationbull Apache Parquet is a columnar storage format available
to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
58
3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark
SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem
bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format
59
3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
60
3 Integration bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html
61
3 Integration
bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from
MapReduce to Sparkbull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed
62
3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014
63
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
64
4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them
Hadoop ecosystem Spark ecosystem
65
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
66
4 Complementarity + bull Mesos and YARN can work together each for what
it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom
httpsparkbigdatacomcomponenttagstag41
67
4 Complementarity + References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management
bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620
68
4 Complementarity + bull Spark on Tez for efficient ETL https
githubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)
bull Tez supports enterprise security
69
4 Complementarity + bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
70
4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml
71
4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop
72
5 Key Takeaways1 Evolution of compute models is still ongoing Watch
out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
73
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
74
1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt
he-perfect-match-apache-spark-meets-swift
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
38
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop
bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark
39
Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml
40
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
41
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
42
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
43
3 IntegrationService Open Source Tool
StorageServing Layer
Data Formats
Data Ingestion ServicesResource Management
Search
SQL
44
3 Integrationbull Spark was designed to read and write data from and to HDFS as
well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851
45
3 Integrationbull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase
46
3 Integration bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra
47
3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new avenues
bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra
48
3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo
httpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights
bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
49
3 Integration bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu
ction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
50
3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph
databasebull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
51
3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A
bull Some issues are critical ones bull Running Spark on YARN
httpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
52
3 Integrationbull Spark SQL provides built in support for Hive tables
bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib
53
3 Integration bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline
Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
54
3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka
55
3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
56
3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data
bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml
57
3 Integrationbull Apache Parquet is a columnar storage format available
to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
58
3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark
SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem
bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format
59
3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
60
3 Integration bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html
61
3 Integration
bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from
MapReduce to Sparkbull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed
62
3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014
63
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
64
4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them
Hadoop ecosystem Spark ecosystem
65
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
66
4 Complementarity + bull Mesos and YARN can work together each for what
it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom
httpsparkbigdatacomcomponenttagstag41
67
4 Complementarity + References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management
bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620
68
4 Complementarity + bull Spark on Tez for efficient ETL https
githubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)
bull Tez supports enterprise security
69
4 Complementarity + bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
70
4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml
71
4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop
72
5 Key Takeaways1 Evolution of compute models is still ongoing Watch
out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
73
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
74
1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt
he-perfect-match-apache-spark-meets-swift
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
39
Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml
40
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
41
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
42
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
43
3 IntegrationService Open Source Tool
StorageServing Layer
Data Formats
Data Ingestion ServicesResource Management
Search
SQL
44
3 Integrationbull Spark was designed to read and write data from and to HDFS as
well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851
45
3 Integrationbull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase
46
3 Integration bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra
47
3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new avenues
bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra
48
3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo
httpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights
bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
49
3 Integration bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu
ction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
50
3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph
databasebull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
51
3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A
bull Some issues are critical ones bull Running Spark on YARN
httpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
52
3 Integrationbull Spark SQL provides built in support for Hive tables
bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib
53
3 Integration bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline
Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
54
3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka
55
3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
56
3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data
bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml
57
3 Integrationbull Apache Parquet is a columnar storage format available
to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
58
3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark
SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem
bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format
59
3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
60
3 Integration bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html
61
3 Integration
bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from
MapReduce to Sparkbull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed
62
3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014
63
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
64
4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them
Hadoop ecosystem Spark ecosystem
65
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
66
4 Complementarity + bull Mesos and YARN can work together each for what
it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom
httpsparkbigdatacomcomponenttagstag41
67
4 Complementarity + References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management
bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620
68
4 Complementarity + bull Spark on Tez for efficient ETL https
githubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)
bull Tez supports enterprise security
69
4 Complementarity + bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
70
4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml
71
4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop
72
5 Key Takeaways1 Evolution of compute models is still ongoing Watch
out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
73
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
74
1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt
he-perfect-match-apache-spark-meets-swift
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
40
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
41
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
42
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
43
3 IntegrationService Open Source Tool
StorageServing Layer
Data Formats
Data Ingestion ServicesResource Management
Search
SQL
44
3 Integrationbull Spark was designed to read and write data from and to HDFS as
well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851
45
3 Integrationbull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase
46
3 Integration bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra
47
3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new avenues
bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra
48
3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo
httpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights
bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
49
3 Integration bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu
ction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
50
3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph
databasebull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
51
3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A
bull Some issues are critical ones bull Running Spark on YARN
httpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
52
3 Integrationbull Spark SQL provides built in support for Hive tables
bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib
53
3 Integration bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline
Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
54
3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka
55
3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
56
3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data
bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml
57
3 Integrationbull Apache Parquet is a columnar storage format available
to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
58
3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark
SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem
bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format
59
3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
60
3 Integration bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html
61
3 Integration
bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from
MapReduce to Sparkbull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed
62
3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014
63
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
64
4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them
Hadoop ecosystem Spark ecosystem
65
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
66
4 Complementarity + bull Mesos and YARN can work together each for what
it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom
httpsparkbigdatacomcomponenttagstag41
67
4 Complementarity + References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management
bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620
68
4 Complementarity + bull Spark on Tez for efficient ETL https
githubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)
bull Tez supports enterprise security
69
4 Complementarity + bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
70
4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml
71
4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop
72
5 Key Takeaways1 Evolution of compute models is still ongoing Watch
out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
73
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
74
1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt
he-perfect-match-apache-spark-meets-swift
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
41
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
42
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
43
3 IntegrationService Open Source Tool
StorageServing Layer
Data Formats
Data Ingestion ServicesResource Management
Search
SQL
44
3 Integrationbull Spark was designed to read and write data from and to HDFS as
well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851
45
3 Integrationbull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase
46
3 Integration bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra
47
3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new avenues
bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra
48
3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo
httpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights
bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
49
3 Integration bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu
ction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
50
3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph
databasebull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
51
3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A
bull Some issues are critical ones bull Running Spark on YARN
httpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
52
3 Integrationbull Spark SQL provides built in support for Hive tables
bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib
53
3 Integration bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline
Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
54
3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka
55
3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
56
3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data
bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml
57
3 Integrationbull Apache Parquet is a columnar storage format available
to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
58
3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark
SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem
bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format
59
3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
60
3 Integration bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html
61
3 Integration
bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from
MapReduce to Sparkbull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed
62
3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014
63
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
64
4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them
Hadoop ecosystem Spark ecosystem
65
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
66
4 Complementarity + bull Mesos and YARN can work together each for what
it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom
httpsparkbigdatacomcomponenttagstag41
67
4 Complementarity + References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management
bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620
68
4 Complementarity + bull Spark on Tez for efficient ETL https
githubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)
bull Tez supports enterprise security
69
4 Complementarity + bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
70
4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml
71
4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop
72
5 Key Takeaways1 Evolution of compute models is still ongoing Watch
out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
73
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
74
1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt
he-perfect-match-apache-spark-meets-swift
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
42
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
43
3 IntegrationService Open Source Tool
StorageServing Layer
Data Formats
Data Ingestion ServicesResource Management
Search
SQL
44
3 Integrationbull Spark was designed to read and write data from and to HDFS as
well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851
45
3 Integrationbull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase
46
3 Integration bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra
47
3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new avenues
bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra
48
3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo
httpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights
bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
49
3 Integration bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu
ction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
50
3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph
databasebull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
51
3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A
bull Some issues are critical ones bull Running Spark on YARN
httpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
52
3 Integrationbull Spark SQL provides built in support for Hive tables
bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib
53
3 Integration bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline
Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
54
3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka
55
3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
56
3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data
bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml
57
3 Integrationbull Apache Parquet is a columnar storage format available
to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
58
3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark
SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem
bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format
59
3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
60
3 Integration bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html
61
3 Integration
bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from
MapReduce to Sparkbull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed
62
3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014
63
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
64
4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them
Hadoop ecosystem Spark ecosystem
65
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
66
4 Complementarity + bull Mesos and YARN can work together each for what
it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom
httpsparkbigdatacomcomponenttagstag41
67
4 Complementarity + References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management
bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620
68
4 Complementarity + bull Spark on Tez for efficient ETL https
githubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)
bull Tez supports enterprise security
69
4 Complementarity + bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
70
4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml
71
4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop
72
5 Key Takeaways1 Evolution of compute models is still ongoing Watch
out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
73
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
74
1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt
he-perfect-match-apache-spark-meets-swift
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
43
3 IntegrationService Open Source Tool
StorageServing Layer
Data Formats
Data Ingestion ServicesResource Management
Search
SQL
44
3 Integrationbull Spark was designed to read and write data from and to HDFS as
well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851
45
3 Integrationbull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase
46
3 Integration bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra
47
3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new avenues
bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra
48
3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo
httpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights
bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
49
3 Integration bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu
ction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
50
3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph
databasebull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
51
3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A
bull Some issues are critical ones bull Running Spark on YARN
httpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
52
3 Integrationbull Spark SQL provides built in support for Hive tables
bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib
53
3 Integration bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline
Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
54
3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka
55
3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
56
3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data
bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml
57
3 Integrationbull Apache Parquet is a columnar storage format available
to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
58
3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark
SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem
bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format
59
3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
60
3 Integration bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html
61
3 Integration
bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from
MapReduce to Sparkbull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed
62
3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014
63
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
64
4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them
Hadoop ecosystem Spark ecosystem
65
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
66
4 Complementarity + bull Mesos and YARN can work together each for what
it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom
httpsparkbigdatacomcomponenttagstag41
67
4 Complementarity + References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management
bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620
68
4 Complementarity + bull Spark on Tez for efficient ETL https
githubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)
bull Tez supports enterprise security
69
4 Complementarity + bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
70
4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml
71
4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop
72
5 Key Takeaways1 Evolution of compute models is still ongoing Watch
out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
73
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
74
1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt
he-perfect-match-apache-spark-meets-swift
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
44
3 Integrationbull Spark was designed to read and write data from and to HDFS as
well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851
45
3 Integrationbull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase
46
3 Integration bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra
47
3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new avenues
bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra
48
3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo
httpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights
bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
49
3 Integration bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu
ction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
50
3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph
databasebull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
51
3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A
bull Some issues are critical ones bull Running Spark on YARN
httpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
52
3 Integrationbull Spark SQL provides built in support for Hive tables
bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib
53
3 Integration bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline
Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
54
3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka
55
3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
56
3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data
bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml
57
3 Integrationbull Apache Parquet is a columnar storage format available
to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
58
3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark
SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem
bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format
59
3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
60
3 Integration bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html
61
3 Integration
bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from
MapReduce to Sparkbull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed
62
3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014
63
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
64
4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them
Hadoop ecosystem Spark ecosystem
65
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
66
4 Complementarity + bull Mesos and YARN can work together each for what
it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom
httpsparkbigdatacomcomponenttagstag41
67
4 Complementarity + References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management
bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620
68
4 Complementarity + bull Spark on Tez for efficient ETL https
githubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)
bull Tez supports enterprise security
69
4 Complementarity + bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
70
4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml
71
4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop
72
5 Key Takeaways1 Evolution of compute models is still ongoing Watch
out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
73
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
74
1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt
he-perfect-match-apache-spark-meets-swift
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
45
3 Integrationbull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase
46
3 Integration bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra
47
3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new avenues
bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra
48
3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo
httpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights
bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
49
3 Integration bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu
ction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
50
3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph
databasebull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
51
3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A
bull Some issues are critical ones bull Running Spark on YARN
httpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
52
3 Integrationbull Spark SQL provides built in support for Hive tables
bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib
53
3 Integration bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline
Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
54
3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka
55
3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
56
3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data
bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml
57
3 Integrationbull Apache Parquet is a columnar storage format available
to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
58
3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark
SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem
bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format
59
3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
60
3 Integration bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html
61
3 Integration
bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from
MapReduce to Sparkbull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed
62
3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014
63
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
64
4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them
Hadoop ecosystem Spark ecosystem
65
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
66
4 Complementarity + bull Mesos and YARN can work together each for what
it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom
httpsparkbigdatacomcomponenttagstag41
67
4 Complementarity + References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management
bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620
68
4 Complementarity + bull Spark on Tez for efficient ETL https
githubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)
bull Tez supports enterprise security
69
4 Complementarity + bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
70
4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml
71
4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop
72
5 Key Takeaways1 Evolution of compute models is still ongoing Watch
out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
73
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
74
1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt
he-perfect-match-apache-spark-meets-swift
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
46
3 Integration bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra
47
3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new avenues
bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra
48
3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo
httpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights
bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
49
3 Integration bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu
ction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
50
3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph
databasebull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
51
3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A
bull Some issues are critical ones bull Running Spark on YARN
httpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
52
3 Integrationbull Spark SQL provides built in support for Hive tables
bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib
53
3 Integration bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline
Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
54
3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka
55
3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
56
3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data
bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml
57
3 Integrationbull Apache Parquet is a columnar storage format available
to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
58
3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark
SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem
bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format
59
3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
60
3 Integration bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html
61
3 Integration
bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from
MapReduce to Sparkbull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed
62
3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014
63
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
64
4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them
Hadoop ecosystem Spark ecosystem
65
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
66
4 Complementarity + bull Mesos and YARN can work together each for what
it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom
httpsparkbigdatacomcomponenttagstag41
67
4 Complementarity + References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management
bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620
68
4 Complementarity + bull Spark on Tez for efficient ETL https
githubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)
bull Tez supports enterprise security
69
4 Complementarity + bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
70
4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml
71
4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop
72
5 Key Takeaways1 Evolution of compute models is still ongoing Watch
out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
73
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
74
1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt
he-perfect-match-apache-spark-meets-swift
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
47
3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new avenues
bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra
48
3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo
httpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights
bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
49
3 Integration bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu
ction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
50
3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph
databasebull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
51
3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A
bull Some issues are critical ones bull Running Spark on YARN
httpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
52
3 Integrationbull Spark SQL provides built in support for Hive tables
bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib
53
3 Integration bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline
Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
54
3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka
55
3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
56
3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data
bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml
57
3 Integrationbull Apache Parquet is a columnar storage format available
to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
58
3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark
SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem
bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format
59
3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
60
3 Integration bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html
61
3 Integration
bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from
MapReduce to Sparkbull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed
62
3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014
63
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
64
4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them
Hadoop ecosystem Spark ecosystem
65
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
66
4 Complementarity + bull Mesos and YARN can work together each for what
it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom
httpsparkbigdatacomcomponenttagstag41
67
4 Complementarity + References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management
bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620
68
4 Complementarity + bull Spark on Tez for efficient ETL https
githubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)
bull Tez supports enterprise security
69
4 Complementarity + bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
70
4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml
71
4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop
72
5 Key Takeaways1 Evolution of compute models is still ongoing Watch
out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
73
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
74
1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt
he-perfect-match-apache-spark-meets-swift
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
48
3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo
httpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights
bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
49
3 Integration bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu
ction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
50
3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph
databasebull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
51
3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A
bull Some issues are critical ones bull Running Spark on YARN
httpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
52
3 Integrationbull Spark SQL provides built in support for Hive tables
bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib
53
3 Integration bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline
Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
54
3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka
55
3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
56
3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data
bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml
57
3 Integrationbull Apache Parquet is a columnar storage format available
to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
58
3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark
SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem
bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format
59
3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
60
3 Integration bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html
61
3 Integration
bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from
MapReduce to Sparkbull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed
62
3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014
63
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
64
4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them
Hadoop ecosystem Spark ecosystem
65
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
66
4 Complementarity + bull Mesos and YARN can work together each for what
it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom
httpsparkbigdatacomcomponenttagstag41
67
4 Complementarity + References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management
bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620
68
4 Complementarity + bull Spark on Tez for efficient ETL https
githubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)
bull Tez supports enterprise security
69
4 Complementarity + bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
70
4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml
71
4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop
72
5 Key Takeaways1 Evolution of compute models is still ongoing Watch
out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
73
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
74
1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt
he-perfect-match-apache-spark-meets-swift
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
49
3 Integration bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu
ction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
50
3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph
databasebull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
51
3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A
bull Some issues are critical ones bull Running Spark on YARN
httpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
52
3 Integrationbull Spark SQL provides built in support for Hive tables
bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib
53
3 Integration bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline
Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
54
3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka
55
3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
56
3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data
bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml
57
3 Integrationbull Apache Parquet is a columnar storage format available
to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
58
3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark
SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem
bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format
59
3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
60
3 Integration bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html
61
3 Integration
bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from
MapReduce to Sparkbull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed
62
3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014
63
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
64
4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them
Hadoop ecosystem Spark ecosystem
65
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
66
4 Complementarity + bull Mesos and YARN can work together each for what
it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom
httpsparkbigdatacomcomponenttagstag41
67
4 Complementarity + References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management
bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620
68
4 Complementarity + bull Spark on Tez for efficient ETL https
githubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)
bull Tez supports enterprise security
69
4 Complementarity + bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
70
4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml
71
4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop
72
5 Key Takeaways1 Evolution of compute models is still ongoing Watch
out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
73
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
74
1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt
he-perfect-match-apache-spark-meets-swift
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
50
3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph
databasebull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
51
3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A
bull Some issues are critical ones bull Running Spark on YARN
httpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
52
3 Integrationbull Spark SQL provides built in support for Hive tables
bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib
53
3 Integration bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline
Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
54
3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka
55
3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
56
3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data
bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml
57
3 Integrationbull Apache Parquet is a columnar storage format available
to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
58
3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark
SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem
bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format
59
3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
60
3 Integration bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html
61
3 Integration
bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from
MapReduce to Sparkbull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed
62
3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014
63
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
64
4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them
Hadoop ecosystem Spark ecosystem
65
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
66
4 Complementarity + bull Mesos and YARN can work together each for what
it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom
httpsparkbigdatacomcomponenttagstag41
67
4 Complementarity + References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management
bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620
68
4 Complementarity + bull Spark on Tez for efficient ETL https
githubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)
bull Tez supports enterprise security
69
4 Complementarity + bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
70
4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml
71
4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop
72
5 Key Takeaways1 Evolution of compute models is still ongoing Watch
out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
73
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
74
1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt
he-perfect-match-apache-spark-meets-swift
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
51
3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A
bull Some issues are critical ones bull Running Spark on YARN
httpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
52
3 Integrationbull Spark SQL provides built in support for Hive tables
bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib
53
3 Integration bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline
Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
54
3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka
55
3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
56
3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data
bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml
57
3 Integrationbull Apache Parquet is a columnar storage format available
to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
58
3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark
SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem
bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format
59
3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
60
3 Integration bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html
61
3 Integration
bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from
MapReduce to Sparkbull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed
62
3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014
63
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
64
4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them
Hadoop ecosystem Spark ecosystem
65
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
66
4 Complementarity + bull Mesos and YARN can work together each for what
it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom
httpsparkbigdatacomcomponenttagstag41
67
4 Complementarity + References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management
bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620
68
4 Complementarity + bull Spark on Tez for efficient ETL https
githubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)
bull Tez supports enterprise security
69
4 Complementarity + bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
70
4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml
71
4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop
72
5 Key Takeaways1 Evolution of compute models is still ongoing Watch
out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
73
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
74
1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt
he-perfect-match-apache-spark-meets-swift
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
52
3 Integrationbull Spark SQL provides built in support for Hive tables
bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib
53
3 Integration bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline
Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
54
3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka
55
3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
56
3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data
bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml
57
3 Integrationbull Apache Parquet is a columnar storage format available
to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
58
3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark
SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem
bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format
59
3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
60
3 Integration bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html
61
3 Integration
bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from
MapReduce to Sparkbull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed
62
3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014
63
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
64
4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them
Hadoop ecosystem Spark ecosystem
65
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
66
4 Complementarity + bull Mesos and YARN can work together each for what
it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom
httpsparkbigdatacomcomponenttagstag41
67
4 Complementarity + References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management
bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620
68
4 Complementarity + bull Spark on Tez for efficient ETL https
githubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)
bull Tez supports enterprise security
69
4 Complementarity + bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
70
4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml
71
4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop
72
5 Key Takeaways1 Evolution of compute models is still ongoing Watch
out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
73
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
74
1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt
he-perfect-match-apache-spark-meets-swift
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
53
3 Integration bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline
Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
54
3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka
55
3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
56
3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data
bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml
57
3 Integrationbull Apache Parquet is a columnar storage format available
to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
58
3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark
SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem
bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format
59
3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
60
3 Integration bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html
61
3 Integration
bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from
MapReduce to Sparkbull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed
62
3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014
63
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
64
4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them
Hadoop ecosystem Spark ecosystem
65
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
66
4 Complementarity + bull Mesos and YARN can work together each for what
it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom
httpsparkbigdatacomcomponenttagstag41
67
4 Complementarity + References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management
bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620
68
4 Complementarity + bull Spark on Tez for efficient ETL https
githubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)
bull Tez supports enterprise security
69
4 Complementarity + bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
70
4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml
71
4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop
72
5 Key Takeaways1 Evolution of compute models is still ongoing Watch
out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
73
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
74
1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt
he-perfect-match-apache-spark-meets-swift
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
54
3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka
55
3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
56
3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data
bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml
57
3 Integrationbull Apache Parquet is a columnar storage format available
to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
58
3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark
SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem
bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format
59
3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
60
3 Integration bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html
61
3 Integration
bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from
MapReduce to Sparkbull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed
62
3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014
63
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
64
4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them
Hadoop ecosystem Spark ecosystem
65
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
66
4 Complementarity + bull Mesos and YARN can work together each for what
it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom
httpsparkbigdatacomcomponenttagstag41
67
4 Complementarity + References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management
bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620
68
4 Complementarity + bull Spark on Tez for efficient ETL https
githubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)
bull Tez supports enterprise security
69
4 Complementarity + bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
70
4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml
71
4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop
72
5 Key Takeaways1 Evolution of compute models is still ongoing Watch
out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
73
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
74
1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt
he-perfect-match-apache-spark-meets-swift
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
55
3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
56
3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data
bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml
57
3 Integrationbull Apache Parquet is a columnar storage format available
to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
58
3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark
SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem
bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format
59
3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
60
3 Integration bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html
61
3 Integration
bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from
MapReduce to Sparkbull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed
62
3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014
63
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
64
4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them
Hadoop ecosystem Spark ecosystem
65
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
66
4 Complementarity + bull Mesos and YARN can work together each for what
it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom
httpsparkbigdatacomcomponenttagstag41
67
4 Complementarity + References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management
bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620
68
4 Complementarity + bull Spark on Tez for efficient ETL https
githubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)
bull Tez supports enterprise security
69
4 Complementarity + bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
70
4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml
71
4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop
72
5 Key Takeaways1 Evolution of compute models is still ongoing Watch
out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
73
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
74
1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt
he-perfect-match-apache-spark-meets-swift
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
56
3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data
bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml
57
3 Integrationbull Apache Parquet is a columnar storage format available
to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
58
3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark
SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem
bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format
59
3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
60
3 Integration bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html
61
3 Integration
bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from
MapReduce to Sparkbull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed
62
3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014
63
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
64
4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them
Hadoop ecosystem Spark ecosystem
65
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
66
4 Complementarity + bull Mesos and YARN can work together each for what
it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom
httpsparkbigdatacomcomponenttagstag41
67
4 Complementarity + References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management
bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620
68
4 Complementarity + bull Spark on Tez for efficient ETL https
githubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)
bull Tez supports enterprise security
69
4 Complementarity + bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
70
4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml
71
4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop
72
5 Key Takeaways1 Evolution of compute models is still ongoing Watch
out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
73
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
74
1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt
he-perfect-match-apache-spark-meets-swift
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
57
3 Integrationbull Apache Parquet is a columnar storage format available
to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
58
3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark
SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem
bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format
59
3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
60
3 Integration bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html
61
3 Integration
bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from
MapReduce to Sparkbull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed
62
3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014
63
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
64
4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them
Hadoop ecosystem Spark ecosystem
65
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
66
4 Complementarity + bull Mesos and YARN can work together each for what
it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom
httpsparkbigdatacomcomponenttagstag41
67
4 Complementarity + References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management
bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620
68
4 Complementarity + bull Spark on Tez for efficient ETL https
githubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)
bull Tez supports enterprise security
69
4 Complementarity + bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
70
4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml
71
4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop
72
5 Key Takeaways1 Evolution of compute models is still ongoing Watch
out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
73
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
74
1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt
he-perfect-match-apache-spark-meets-swift
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
58
3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark
SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem
bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format
59
3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
60
3 Integration bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html
61
3 Integration
bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from
MapReduce to Sparkbull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed
62
3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014
63
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
64
4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them
Hadoop ecosystem Spark ecosystem
65
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
66
4 Complementarity + bull Mesos and YARN can work together each for what
it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom
httpsparkbigdatacomcomponenttagstag41
67
4 Complementarity + References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management
bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620
68
4 Complementarity + bull Spark on Tez for efficient ETL https
githubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)
bull Tez supports enterprise security
69
4 Complementarity + bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
70
4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml
71
4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop
72
5 Key Takeaways1 Evolution of compute models is still ongoing Watch
out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
73
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
74
1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt
he-perfect-match-apache-spark-meets-swift
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
59
3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
60
3 Integration bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html
61
3 Integration
bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from
MapReduce to Sparkbull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed
62
3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014
63
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
64
4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them
Hadoop ecosystem Spark ecosystem
65
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
66
4 Complementarity + bull Mesos and YARN can work together each for what
it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom
httpsparkbigdatacomcomponenttagstag41
67
4 Complementarity + References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management
bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620
68
4 Complementarity + bull Spark on Tez for efficient ETL https
githubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)
bull Tez supports enterprise security
69
4 Complementarity + bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
70
4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml
71
4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop
72
5 Key Takeaways1 Evolution of compute models is still ongoing Watch
out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
73
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
74
1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt
he-perfect-match-apache-spark-meets-swift
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
60
3 Integration bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html
61
3 Integration
bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from
MapReduce to Sparkbull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed
62
3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014
63
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
64
4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them
Hadoop ecosystem Spark ecosystem
65
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
66
4 Complementarity + bull Mesos and YARN can work together each for what
it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom
httpsparkbigdatacomcomponenttagstag41
67
4 Complementarity + References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management
bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620
68
4 Complementarity + bull Spark on Tez for efficient ETL https
githubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)
bull Tez supports enterprise security
69
4 Complementarity + bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
70
4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml
71
4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop
72
5 Key Takeaways1 Evolution of compute models is still ongoing Watch
out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
73
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
74
1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt
he-perfect-match-apache-spark-meets-swift
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
61
3 Integration
bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from
MapReduce to Sparkbull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed
62
3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014
63
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
64
4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them
Hadoop ecosystem Spark ecosystem
65
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
66
4 Complementarity + bull Mesos and YARN can work together each for what
it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom
httpsparkbigdatacomcomponenttagstag41
67
4 Complementarity + References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management
bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620
68
4 Complementarity + bull Spark on Tez for efficient ETL https
githubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)
bull Tez supports enterprise security
69
4 Complementarity + bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
70
4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml
71
4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop
72
5 Key Takeaways1 Evolution of compute models is still ongoing Watch
out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
73
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
74
1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt
he-perfect-match-apache-spark-meets-swift
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
62
3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014
63
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
64
4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them
Hadoop ecosystem Spark ecosystem
65
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
66
4 Complementarity + bull Mesos and YARN can work together each for what
it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom
httpsparkbigdatacomcomponenttagstag41
67
4 Complementarity + References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management
bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620
68
4 Complementarity + bull Spark on Tez for efficient ETL https
githubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)
bull Tez supports enterprise security
69
4 Complementarity + bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
70
4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml
71
4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop
72
5 Key Takeaways1 Evolution of compute models is still ongoing Watch
out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
73
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
74
1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt
he-perfect-match-apache-spark-meets-swift
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
63
III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways
64
4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them
Hadoop ecosystem Spark ecosystem
65
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
66
4 Complementarity + bull Mesos and YARN can work together each for what
it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom
httpsparkbigdatacomcomponenttagstag41
67
4 Complementarity + References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management
bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620
68
4 Complementarity + bull Spark on Tez for efficient ETL https
githubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)
bull Tez supports enterprise security
69
4 Complementarity + bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
70
4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml
71
4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop
72
5 Key Takeaways1 Evolution of compute models is still ongoing Watch
out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
73
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
74
1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt
he-perfect-match-apache-spark-meets-swift
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
64
4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them
Hadoop ecosystem Spark ecosystem
65
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
66
4 Complementarity + bull Mesos and YARN can work together each for what
it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom
httpsparkbigdatacomcomponenttagstag41
67
4 Complementarity + References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management
bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620
68
4 Complementarity + bull Spark on Tez for efficient ETL https
githubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)
bull Tez supports enterprise security
69
4 Complementarity + bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
70
4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml
71
4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop
72
5 Key Takeaways1 Evolution of compute models is still ongoing Watch
out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
73
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
74
1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt
he-perfect-match-apache-spark-meets-swift
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
65
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
66
4 Complementarity + bull Mesos and YARN can work together each for what
it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom
httpsparkbigdatacomcomponenttagstag41
67
4 Complementarity + References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management
bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620
68
4 Complementarity + bull Spark on Tez for efficient ETL https
githubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)
bull Tez supports enterprise security
69
4 Complementarity + bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
70
4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml
71
4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop
72
5 Key Takeaways1 Evolution of compute models is still ongoing Watch
out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
73
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
74
1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt
he-perfect-match-apache-spark-meets-swift
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
66
4 Complementarity + bull Mesos and YARN can work together each for what
it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom
httpsparkbigdatacomcomponenttagstag41
67
4 Complementarity + References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management
bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620
68
4 Complementarity + bull Spark on Tez for efficient ETL https
githubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)
bull Tez supports enterprise security
69
4 Complementarity + bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
70
4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml
71
4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop
72
5 Key Takeaways1 Evolution of compute models is still ongoing Watch
out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
73
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
74
1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt
he-perfect-match-apache-spark-meets-swift
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
67
4 Complementarity + References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management
bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620
68
4 Complementarity + bull Spark on Tez for efficient ETL https
githubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)
bull Tez supports enterprise security
69
4 Complementarity + bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
70
4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml
71
4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop
72
5 Key Takeaways1 Evolution of compute models is still ongoing Watch
out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
73
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
74
1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt
he-perfect-match-apache-spark-meets-swift
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
68
4 Complementarity + bull Spark on Tez for efficient ETL https
githubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)
bull Tez supports enterprise security
69
4 Complementarity + bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
70
4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml
71
4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop
72
5 Key Takeaways1 Evolution of compute models is still ongoing Watch
out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
73
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
74
1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt
he-perfect-match-apache-spark-meets-swift
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
69
4 Complementarity + bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
70
4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml
71
4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop
72
5 Key Takeaways1 Evolution of compute models is still ongoing Watch
out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
73
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
74
1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt
he-perfect-match-apache-spark-meets-swift
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
70
4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml
71
4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop
72
5 Key Takeaways1 Evolution of compute models is still ongoing Watch
out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
73
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
74
1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt
he-perfect-match-apache-spark-meets-swift
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
71
4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop
72
5 Key Takeaways1 Evolution of compute models is still ongoing Watch
out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
73
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
74
1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt
he-perfect-match-apache-spark-meets-swift
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
72
5 Key Takeaways1 Evolution of compute models is still ongoing Watch
out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
73
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
74
1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt
he-perfect-match-apache-spark-meets-swift
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
73
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
74
1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt
he-perfect-match-apache-spark-meets-swift
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
74
1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt
he-perfect-match-apache-spark-meets-swift
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
75
1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS
storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
76
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
77
2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -
httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -
httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
78
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives6 Key Takeaways
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
79
3 Distributionsbull Using Spark on a Non-Hadoop distribution
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
80
Cloud
bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
81
DSEbull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
82
bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
83
bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications
bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp
sparkbigdatacomcomponenttagstag39
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
84
bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments
bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
86
IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
87
4 AlternativesHadoop Ecosystem Spark EcosystemComponent
HDFS Tachyon YARN Mesos
ToolsPig Spark native APIHive Spark SQL
Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
88
bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change
bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
89
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing
apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
90
YARN vs MesosCriteria
Resource sharing
Yes Yes
Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups
Requests Specific requests and locality preference
More generic but more coding for writing frameworks
Maturity Less mature Relatively more mature
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
91
Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup
bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
92
Spark SQLbull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore
bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
93
Spark MLlib
lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
94
Spark Streaming
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
95
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash every record processed
At least one ( may be duplicates)
Exactly one
Batch Framework integration
Not available Core Spark API
Supported languages
Any programming language
Scala Java Python
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
96
GraphX
lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
97
Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook
bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
98
IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
99
6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
100
IV More QampA
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbaltagi
SlimBaltagi
httpwwwslidesharenetsbaltagi