100
Spark or Hadoop: Is it an either-or proposition? By Slim Baltagi (@SlimBaltagi ) Big Data Practice Director Advanced Analytics LLC OR XOR ?? Los Angeles Spark Users Group March 12, 2015

Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

Embed Size (px)

Citation preview

Page 1: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

Spark or Hadoop Is it an either-or

proposition

By Slim Baltagi (SlimBaltagi)

Big Data Practice Director

Advanced Analytics LLC

OR

XOR

Los Angeles Spark Users Group

March 12 2015

Your Presenter ndash Slim Baltagi

2

bull Sr Big Data Solutions Architect

living in Chicago

bull Over 17 years of IT and business

experiences

bull Over 4 years of Big Data

experience working on over 12

Hadoop projects

bull Speaker at Big Data events

bull Creator and maintainer of the

Apache Spark Knowledge

Base

httpwwwSparkBigDatacom

with over 4000 categorized

Apache Spark web resources

SlimBaltagi

httpswwwlinkedincominslimbalta

gi

sbaltagigmailcom

Disclaimer This is a vendor-independent talk that expresses my own

opinions I am not endorsing nor promoting any product or vendor mentioned in

this talk

Agenda

I Motivation

II Big Data Typical Big Data

Stack Apache Hadoop

Apache Spark

III Spark with Hadoop

IV Spark without Hadoop

V More QampA

3

I Motivation

1 News

2 Surveys

3 Vendors

4 Analysts

5 Key Takeaways

4

1 Newsbull Is it Spark vs OR and Hadoop

bull Apache Spark Hadoop friend or foe

bull Apache Spark killer or savior of Apache Hadoop

bull Apache Sparks Marriage To Hadoop Will Be Bigger Than Kim And Kanye

bull Adios Hadoop Hola Spark

bull Apache Spark Moving on from Hadoop

bull Apache Spark Continues to Spread Beyond Hadoop

bull Escape From Hadoop

bull Spark promises to end up Hadoop but in a good way

5

2 Surveysbull Hadoops historic focus on batch processing of data

was well supported by MapReduce but there is an

appetite for more flexible developer tools to support

the larger market of mid-size datasets and use cases

that call for real-time processingrdquo 2015 Apache Spark

Survey by Typesafe January 27 2015

httpwwwmarketwiredcompress-releasesurvey-indicates-apache-spark-

gaining-developer-adoption-as-big-datas-projects-1986162htm

bull Apache Spark Preparing for the Next Wave of

Reactive Big Data January 27 2015 by Typesafe

httptypesafecomblogapache-spark-preparing-for-the-next-wave-of-reactive-

big-data

6

Apache Spark Survey 2015 by

Typesafe - Quick Snapshot

7

3 Vendors

8

bull Spark and Hadoop Working Together January 21

2014 by Ion Stoica httpsdatabrickscomblog20140121spark-and-

hadoophtml

bull Uniform API for diverse workloads over diverse

storage systems and runtimes

Source Slide 16 of lsquoSparks Role in the Big Data Ecosystem (Spark

Summit 2014) November 2014 Matei

Zahariahttpwwwslidesharenetdatabricksspark-summit2014

bull The goal of Apache Spark is to have one engine for all

data sources workloads and environmentsrdquo

Source Slide 15 of lsquoNew Directions for Apache Spark in 2015

February 20 2015 Strata + Hadoop Summit Matei Zaharia

httpwwwslidesharenetdatabricksnew-directions-for-apache-spark-in-2015

3 Vendorsbull ldquoSpark is already an excellent piece of software and is

advancing very quickly No vendor mdash no new project mdashis likely to catch up Chasing Spark would be a wasteof time and would delay availability of real-time analyticand processing services for no good reason rdquoSource MapReduce and Spark December 302013 httpvisionclouderacommapreduce-spark

bull ldquoApache Spark is an open source parallel dataprocessing framework that complements ApacheHadoop to make it easy to develop fast unified Big Dataapplications combining batch streaming and interactiveanalytics on all your datardquohttpwwwclouderacomcontentclouderaenproducts-and-servicescdhsparkhtml

9

3 Vendorsbull ldquoApache Spark is a general-purpose engine for large-

scale data processing Spark supports rapid application

development for big data and allows for code reuse

across batch interactive and streaming applications

Spark also provides advanced execution graphs with in-

memory pipelining to speed up end-to-end application

performancerdquo httpswwwmaprcomproductsapache-spark

bull MapR Adds Complete Apache Spark Stack to its

Distribution for Hadoophttpswwwmaprcomcompanypress-releasesmapr-adds-complete-apache-

spark-stack-its-distribution-hadoop

10

3 Vendorsbull ldquoApache Spark provides an elegant attractive development API and allows data workers to rapidly iterate over data via machine learning and other data science techniques that require fast in-memory data processingrdquo httphortonworkscomhadoopspark

bull Hortonworks A shared vision for Apache Spark on Hadoop October 212014httpsdatabrickscomblog20141031hortonworks-a-shared-vision-for-apache-spark-on-hadoophtml

bull ldquoAt Hortonworks we love Spark and want to help our customers leverage all its benefitsrdquo October 30th 2014httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

11

4 Analystsbull Is Apache Spark replacing Hadoop or complementing

existing Hadoop practice

bull Both are already happening

bull With uncertainty about ldquowhat is Hadooprdquo there is no

reason to think solution stacks built on Spark not

positioned as Hadoop will not continue to proliferate

as the technology matures

bull At the same time Hadoop distributions are all

embracing Spark and including it in their offerings

Source Hadoop Questions from Recent Webinar Span Spectrum

February 25 2015httpblogsgartnercommerv-adrian20150225hadoop-

questions-from-recent-webinar-span-spectrum

12

4 Analysts bull ldquoAfter hearing the confusion between Spark and

Hadoop one too many times I was inspired to write a report The Hadoop Ecosystem Overview Q4 2104

bull For those that have day jobs that donrsquot include constantly tracking Hadoop evolution I dove in and worked with Hadoop vendors and trusted consultants to create a framework

bull We divided the complex Hadoop ecosystem into a core set of tools that all work closely with data stored in Hadoop File System and extended group of components that leverage but do not require itrdquo Source Elephants Pigs Rhinos and Giraphs Oh My ndash Its Time To Get A Handle On Hadoop Posted by Brian Hopkins on November 26 2014

httpblogsforrestercombrian_hopkins14-11-26-elephants_pigs_rhinos_and_giraphs_oh_my_its_time_to_get_a_handle_on_hadoop

13

5 Key Takeaways

1 News Big Data is no longer a Hadoop

monopoly

2 Surveys Listen to what Spark developers are

saying

3 Vendors ltHadoop Vendorgt-tinted goggles

FUD is still being lsquoofferedrsquo by some Hadoop

vendors Claims need to be contextualized

4 Analysts Thorough understanding of the

market dynamics

14

II Big Data Typical Big Data

Stack Hadoop Spark

1 Big Data

2 Typical Big Data Stack

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways

15

1 Big Databull Big Data is still one of the most inflated buzzword of

the last years

bull Big Data is a broad term for data sets so large or

complex that traditional data processing tools are

inadequate httpenwikipediaorgwikiBig_data

bull Hadoop is becoming a traditional tool Above

definition is inadequate

bull ldquoBig Data refers to datasets and flows large enough

that has outpaced our capability to store process

analyze and understandrdquo Amir H Payberah

Swedish Institute of Computer Science (SICS)

16

2 Typical Big Data Stack

17

3 Apache Hadoopbull Apache Hadoop as an example of a Typical Big Data

Stack

bull Hadoop ecosystem = Hadoop Stack + many other tools

(either open source and free or commercial ones)

bull Big Data Ecosystem Dataset httpbigdataandreamostosiname

Incomplete but a useful list of Big Data related projects

packed into a JSON dataset

bull Hadoops Impact on Data Managements Future - Amr

Awadallah (Strata + Hadoop 2015) February 19 2015 Watch

video at 236 on lsquoHadoop Isnrsquot Just Hadoop Anymorersquo for a picture

representing the evolution of Apache Hadoop

httpswwwyoutubecomwatchv=1KvTZZAkHy0

18

4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stack

bull Apache Spark provides you Big Data computing and more

bull BYOS Bring Your Own Storage

bull BYOC Bring Your Own Cluster

bull Spark Core httpsparkbigdatacomcomponenttagstag11-core-spark

bull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-streaming

bull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sql

bull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllib

bull GraphX httpsparkbigdatacomcomponenttagstag6-graphx

bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned

19

5 Key Takeaways

1 Big Data Still one of the most inflated

buzzword

2 Typical Big Data Stack Big Data Stacks look

similar on paper Arenrsquot they

3 Apache Hadoop Hadoop is no longer

lsquosynonymousrsquo of Big Data

4 Apache Spark Emergence of the Apache

Spark ecosystem

20

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

21

1 Evolution of Programming APIs

bull MapReduce in Java is like assembly code of Big

Data httpwikiapacheorghadoopWordCount

bull Pig httppigapacheorg

bull Hive httphiveapacheorg

bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi

bull Cascading httpwwwcascadingorg

bull Scalding A Scala API for Cascading httptwittercomscalding

bull Crunch httpcrunchapacheorg

bull Scrunch httpcrunchapacheorgscrunchhtml

22

1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007

MapReduce v1 was the only choice as a compute model

(Execution Engine) on Hadoop Now we have in addition

to MapReduce v2 Tez Spark and Flink

23

bull Batch bull Batch

bull Interactive

bull Batch

bull Interactive

bull Near-Real

time

bull Batch

bull Interactive

bull Real-Time

bull Iterative

bull 1st

Generation

bull 2nd

Generation

bull 3rd

Generation

bull 4th

Generation

1 Evolution

bull This is how Hadoop MapReduce is branding itself ldquoA YARN-based system for parallel processing of large data sets httphadoopapacheorg

bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip

bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job

bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics

24

1 Evolution

bull Tez Hindi for ldquospeedrdquo

bull This is how Apache Tez is branding itself ldquoTheApache Tez project is aimed at building anapplication framework which allows for a complexdirected-acyclic-graph of tasks for processingdata It is currently built atop YARNrdquo

Source httptezapacheorg

bull Apachetrade Tez is an extensible framework for

building high performance batch and

interactive data processing applicationscoordinated by YARN in Apache Hadoop

25

1 Evolution

bull lsquoSparkrsquo for lightning fast speed

bull This is how Apache Spark is branding itselfldquoApache Sparktrade is a fast and general engine forlarge-scale data processingrdquo httpssparkapacheorg

bull Apache Spark is a general purpose clustercomputing framework its execution modelsupports wide variety of use cases batchinteractive near-real time

bull The rapid in-memory processing of resilientdistributed datasets (RDDs) is the ldquocorecapabilityrdquo of Apache Spark

26

1 Evolution Apache Flink

bull Flink German for ldquonimble swift speedyrdquo

bull This is how Apache Flink is branding itself ldquoFast andreliable large-scale data processing enginerdquo

bull Apache Flink httpflinkapacheorg offers

bull Batch and Streaming in the same system

bull Beyond DAGs (Cyclic operator graphs)

bull Powerful expressive APIs

bull Inside-the-system iterations

bull Full Hadoop compatibility

bull Automatic language independent optimizer

bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink

27

Hadoop MapReduce vs Tez vs Spark

Criteria

License Open Source

Apache 20 version

2x

Open Source

Apache 20

version 0x

Open Source

Apache 20 version

1x

Processing

Model

On-Disk (Disk-

based

parallelization)

Batch

On-Disk Batch

Interactive

In-Memory On-Disk

Batch Interactive

Streaming (Near Real-

Time)

Language written

in

Java Java Scala

API [Java Python

Scala] User-Facing

Java[

ISVEngineTool

builder]

[Scala Java Python]

User-Facing

Libraries None separate tools None [Spark Core Spark

Streaming Spark SQL

MLlib GraphX]

28

Hadoop MapReduce vs Tez vs Spark

Criteria

Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to

Hadoop

Ease of Use Difficult to program

needs abstractions

No Interactive mode

except Hive Pig

Difficult to program

No Interactive

mode except Hive

Pig

Easy to program

no need of

abstractions

Interactive mode

Compatibilit

y

to data types and data

sources is same

to data types and

data sources is

same

to data types and

data sources is

same

YARN

integration

YARN application Ground up YARN

application

Spark is moving

towards YARN

29

Hadoop MapReduce vs Tez vs Spark

Criteria

Deployment YARN YARN [Standalone YARN

SIMR Mesos hellip]

Performance - Good performance

when data fits into

memory

- performance

degradation otherwise

Security More features and

projects

More

features and

projects

Still in its infancy

30

Partial support

IV Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

31

2 Transition

bull Existing Hadoop MapReduce projects can

migrate to Spark and leverage Spark Core as

execution engine

1 You can often reuse your mapper and

reducer functions and just call them in

Spark from Java or Scala

2 You can translate your code from

MapReduce to Apache Spark How-to

Translate from MapReduce to Apache Sparkhttpblogclouderacomblog201409how-to-translate-from-mapreduce-to-

apache-spark

32

2 Transition

3 The following tools originally based on Hadoop

MapReduce are being ported to Apache Spark

bull Pig Hive Sqoop Cascading Crunch Mahout hellip

33

Pig on Spark (Spork)

bull Run Pig with ldquondashx sparkrdquo option for an easy migration

without development effort

bull Speed up your existing pig scripts on Spark ( Query

Logical Plan Physical Pan)

bull Leverage new Spark specific operators in Pig such as

Cache

bull Still leverage many existing Pig UDF libraries

bull Pig on Spark Umbrella Jira (Status Passed end-to-end test

cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059

bull Fix outstanding issues and address additional Spark functionality

through the community

bull lsquoPig on Sparkrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag19

34

Hive on Spark (Currently in Beta

Expected in Hive 110)

bull New alternative to using MapReduce or Tez

hivegt set hiveexecutionengine=spark

bull Help existing Hive applications running on

MapReduce or Tez easily migrate to Spark without

development effort

bull Exposes Spark users to a viable feature-rich de facto

standard SQL tool on Hadoop

bull Performance benefits especially for Hive queries

involving multiple reducer stages

bull Hive on Spark Umbrella Jira (Status Open) Q1 2015httpsissuesapacheorgjirabrowseHIVE-7292

35

Hive on Spark (Currently in Beta

Expected in Hive 110)

bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-

motivations-and-design-principles

bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Start

ed

bull Hive on Spark February 11 2015 Szehon Ho

Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark

bull Hive on spark is blazing fast or is it Carter Shanklin and

Mostapah Mokhtar (Hortonworks) February 20 2015httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final

bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12

36

Sqoop on Spark

(Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially

developed as a tool to transfer data from RDBMS to

Hadoop

bull The next version of Sqoop referred to as Sqoop2

supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under

discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Pro

posal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira

Status Work In Progress) The goal of this ticket is to support a

pluggable way to select the execution engine on which we can run

the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

37

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application

development platform for building data applications on

Hadoop

bull Support for Apache Spark is on the roadmap and will be

available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the

transition from CascadingScalding to Spark a little

easier by adding support for Cascading Taps Scalding

Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark

38

Apache Crunch

bull The Apache Crunch Java library provides a

framework for writing testing and running

MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a

SparkPipeline class making it easy to migrate

data processing applications from MapReduce

to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark

Pipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-

xtopicscdh_ig_running_crunch_with_sparkhtml

39

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce

Apache Mahout the original Machine Learning (ML)

library for Hadoop since 2009 is rejecting new

MapReduce algorithm

implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark

bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this

DSL are automatically optimized and executed in

parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for

Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

40

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov

April 2014

httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with

Mahout Scala and Spark Published on May 30 2014

httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-

with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)-

MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

41

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

42

3 IntegrationService Open Source Tool

StorageServi

ng Layer

Data Formats

Data

Ingestion

Services

Resource

Management

Search

SQL

43

3 Integration

bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851

44

3 Integration

bull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via

newAPIHadoopRDD Example HBaseTestscala from

Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach

esparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available

for reading from and writing to HBase without the need

of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with

Spark Status Still in experimentation and no timetable for

possible support httpblogclouderacomblog201412new-in-cloudera-

labs-sparkonhbase

45

3 Integration

bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark

RDDs to Cassandra tables and execute arbitrary CQL

queries in your Spark applications Supports also

integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration

is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra

46

3 Integration

bull Benchmark of Spark amp Cassandra Integration

using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume

data from Cassandra to spark and store Resilient

Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new

avenues

bull Kindling An Introduction to Spark with Cassandra

(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-

spark-with-cassandra

47

3 Integration

bull MongoDB is not directly served by Spark although

it can be used from Spark via an official Mongo-

Hadoop connector

bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-

insights

bull Spark SQL also provides indirect support via its

support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

48

3 Integration

bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from

Apache Spark (still experimental)

bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-

introduction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-

example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-

example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without

Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

49

3 Integration

bull Neo4j is a highly scalable robust (fully ACID) native graph

database

bull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015

httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015

httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph

Analytics By Kenny Bastani November 3 2014

httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

50

3 Integration YARN

bull YARN Yet Another Resource Negotiator Implicit

reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND

20summary20~20yarn20AND20status203D20OPEN20ORDER20

BY20priority20DESC0A

bull Some issues are critical ones

bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

51

3 Integration

bull Spark SQL provides built in support for Hivetables

bull Import relational data from Hive tables

bull Run SQL queries over imported data

bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120

bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib

52

3 Integration

bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to

address new use cases

bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data

sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query

in-memory data in Spark Embed Drill execution in a

Spark data pipeline

Source Whats Coming in 2015 for

Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

53

3 Integration

bull Apache Kafka is a high throughput distributed

messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka

Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming

Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-

example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka

54

3 Integration

bull Apache Flume is a streaming event data

ingestion system that is designed for Big Data

ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with

Flume There are two approaches to this

bull Approach 1 Flume-style Push-based Approach

bull Approach 2 (Experimental) Pull-based

Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

55

3 Integration

bull Spark SQL provides built in support for JSON that

is vastly simplifying the end-to-end-experience of

working with JSON data

bull Spark SQL can automatically infer the schema

of a JSON dataset and load it as a

SchemaRDD No more DDL Just point Spark

SQL to JSON files and query Starting Spark 13

SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-

support-in-spark-sqlhtml

56

3 Integration

bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows to

bull Import relational data from Parquet files

bull Run SQL queries over imported data

bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

57

3 Integration

bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015

bull Problem

bull Various inbound data sets

bull Data Layout can change without notice

bull New data sets can be added without notice

Result

bull Leverage Spark to dynamically split the data

bull Leverage Avro to store the data in a compact binary format

58

3 Integration Kite SDK

bull The Kite SDK provides high level abstractions to

work with datasets on Hadoop hiding many of

the details of compression codecs file formats

partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016

release so Spark jobs can read and write to Kite

datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

59

3 Integration

bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between

Elasticsearch and Apache Spark in the form of RDD that can

read data from Elasticsearch Also any RDD can be saved to

Elasticsearch as long as its content can be translated into

documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache

Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html

60

3 Integration

bull Apache Solr added a Spark-based indexing tool for

fast and easy indexing ingestion and serving

searchable complex data ldquoCrunchIndexerTool on

Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark

Crunch and Morphlines

bull Migrate ingestion of HDFS data into Solr from

MapReduce to Spark

bull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-

intosolrusingsparktrimmed

61

3 Integration

bull HUE is the open source Apache Hadoop Web UI

that lets users use Hadoop directly from their

browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark

Igniter lets users execute and monitor Spark jobs

directly from their browser and be more

productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-

hadoop-by-enrico-berti-at-big-data-spain-2014

62

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

63

4 Complementarity

Components of Hadoop ecosystem and Spark ecosystem

can work together each for what it is especially good at

rather than choosing one of them

64

Hadoop ecosystem Spark ecosystem

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

65

4 Complementarity +

bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesos

bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41

66

4 Complementarity +

References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN

cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache

Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-

resource-management

bull YARN vs MESOS Canrsquot We All Just Get

Along httpstrataconfcombig-data-conference-ca-

2015publicscheduledetail40620

67

4 Complementarity +

bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization

strategies (building the DAG with knowledge of data

distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the

need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with

YARN (resource chaining in clusters)

bull Tez supports enterprise security

68

4 Complementarity +

bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better

since it is more ldquostream orientedrdquo has more mature

shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory

parsed data it can be much better when we process

data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native

YARN Integration httphortonworkscomblogimproving-spark-data-

pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

69

4 Complementarity

bull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal

compute framework at each step in the big data

analytics process based on the type of platform the

attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on

November 13 2014 with Matt Schumpert Director of Product

Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution

Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-

right-execution-enginehtml

70

4 Complementarity

bull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles

Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-

2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to

Mainstream Apache Hadoop Adoption February 12 2015

httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-

Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms

February 23 2015

httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-

migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015

httpblogsyncsortcom201503framework-future-hadoop

71

5 Key Takeaways1 Evolution of compute models is still ongoing

Watch out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

72

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

73

1 File System

Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml

bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift

74

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 2: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

Your Presenter ndash Slim Baltagi

2

bull Sr Big Data Solutions Architect

living in Chicago

bull Over 17 years of IT and business

experiences

bull Over 4 years of Big Data

experience working on over 12

Hadoop projects

bull Speaker at Big Data events

bull Creator and maintainer of the

Apache Spark Knowledge

Base

httpwwwSparkBigDatacom

with over 4000 categorized

Apache Spark web resources

SlimBaltagi

httpswwwlinkedincominslimbalta

gi

sbaltagigmailcom

Disclaimer This is a vendor-independent talk that expresses my own

opinions I am not endorsing nor promoting any product or vendor mentioned in

this talk

Agenda

I Motivation

II Big Data Typical Big Data

Stack Apache Hadoop

Apache Spark

III Spark with Hadoop

IV Spark without Hadoop

V More QampA

3

I Motivation

1 News

2 Surveys

3 Vendors

4 Analysts

5 Key Takeaways

4

1 Newsbull Is it Spark vs OR and Hadoop

bull Apache Spark Hadoop friend or foe

bull Apache Spark killer or savior of Apache Hadoop

bull Apache Sparks Marriage To Hadoop Will Be Bigger Than Kim And Kanye

bull Adios Hadoop Hola Spark

bull Apache Spark Moving on from Hadoop

bull Apache Spark Continues to Spread Beyond Hadoop

bull Escape From Hadoop

bull Spark promises to end up Hadoop but in a good way

5

2 Surveysbull Hadoops historic focus on batch processing of data

was well supported by MapReduce but there is an

appetite for more flexible developer tools to support

the larger market of mid-size datasets and use cases

that call for real-time processingrdquo 2015 Apache Spark

Survey by Typesafe January 27 2015

httpwwwmarketwiredcompress-releasesurvey-indicates-apache-spark-

gaining-developer-adoption-as-big-datas-projects-1986162htm

bull Apache Spark Preparing for the Next Wave of

Reactive Big Data January 27 2015 by Typesafe

httptypesafecomblogapache-spark-preparing-for-the-next-wave-of-reactive-

big-data

6

Apache Spark Survey 2015 by

Typesafe - Quick Snapshot

7

3 Vendors

8

bull Spark and Hadoop Working Together January 21

2014 by Ion Stoica httpsdatabrickscomblog20140121spark-and-

hadoophtml

bull Uniform API for diverse workloads over diverse

storage systems and runtimes

Source Slide 16 of lsquoSparks Role in the Big Data Ecosystem (Spark

Summit 2014) November 2014 Matei

Zahariahttpwwwslidesharenetdatabricksspark-summit2014

bull The goal of Apache Spark is to have one engine for all

data sources workloads and environmentsrdquo

Source Slide 15 of lsquoNew Directions for Apache Spark in 2015

February 20 2015 Strata + Hadoop Summit Matei Zaharia

httpwwwslidesharenetdatabricksnew-directions-for-apache-spark-in-2015

3 Vendorsbull ldquoSpark is already an excellent piece of software and is

advancing very quickly No vendor mdash no new project mdashis likely to catch up Chasing Spark would be a wasteof time and would delay availability of real-time analyticand processing services for no good reason rdquoSource MapReduce and Spark December 302013 httpvisionclouderacommapreduce-spark

bull ldquoApache Spark is an open source parallel dataprocessing framework that complements ApacheHadoop to make it easy to develop fast unified Big Dataapplications combining batch streaming and interactiveanalytics on all your datardquohttpwwwclouderacomcontentclouderaenproducts-and-servicescdhsparkhtml

9

3 Vendorsbull ldquoApache Spark is a general-purpose engine for large-

scale data processing Spark supports rapid application

development for big data and allows for code reuse

across batch interactive and streaming applications

Spark also provides advanced execution graphs with in-

memory pipelining to speed up end-to-end application

performancerdquo httpswwwmaprcomproductsapache-spark

bull MapR Adds Complete Apache Spark Stack to its

Distribution for Hadoophttpswwwmaprcomcompanypress-releasesmapr-adds-complete-apache-

spark-stack-its-distribution-hadoop

10

3 Vendorsbull ldquoApache Spark provides an elegant attractive development API and allows data workers to rapidly iterate over data via machine learning and other data science techniques that require fast in-memory data processingrdquo httphortonworkscomhadoopspark

bull Hortonworks A shared vision for Apache Spark on Hadoop October 212014httpsdatabrickscomblog20141031hortonworks-a-shared-vision-for-apache-spark-on-hadoophtml

bull ldquoAt Hortonworks we love Spark and want to help our customers leverage all its benefitsrdquo October 30th 2014httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

11

4 Analystsbull Is Apache Spark replacing Hadoop or complementing

existing Hadoop practice

bull Both are already happening

bull With uncertainty about ldquowhat is Hadooprdquo there is no

reason to think solution stacks built on Spark not

positioned as Hadoop will not continue to proliferate

as the technology matures

bull At the same time Hadoop distributions are all

embracing Spark and including it in their offerings

Source Hadoop Questions from Recent Webinar Span Spectrum

February 25 2015httpblogsgartnercommerv-adrian20150225hadoop-

questions-from-recent-webinar-span-spectrum

12

4 Analysts bull ldquoAfter hearing the confusion between Spark and

Hadoop one too many times I was inspired to write a report The Hadoop Ecosystem Overview Q4 2104

bull For those that have day jobs that donrsquot include constantly tracking Hadoop evolution I dove in and worked with Hadoop vendors and trusted consultants to create a framework

bull We divided the complex Hadoop ecosystem into a core set of tools that all work closely with data stored in Hadoop File System and extended group of components that leverage but do not require itrdquo Source Elephants Pigs Rhinos and Giraphs Oh My ndash Its Time To Get A Handle On Hadoop Posted by Brian Hopkins on November 26 2014

httpblogsforrestercombrian_hopkins14-11-26-elephants_pigs_rhinos_and_giraphs_oh_my_its_time_to_get_a_handle_on_hadoop

13

5 Key Takeaways

1 News Big Data is no longer a Hadoop

monopoly

2 Surveys Listen to what Spark developers are

saying

3 Vendors ltHadoop Vendorgt-tinted goggles

FUD is still being lsquoofferedrsquo by some Hadoop

vendors Claims need to be contextualized

4 Analysts Thorough understanding of the

market dynamics

14

II Big Data Typical Big Data

Stack Hadoop Spark

1 Big Data

2 Typical Big Data Stack

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways

15

1 Big Databull Big Data is still one of the most inflated buzzword of

the last years

bull Big Data is a broad term for data sets so large or

complex that traditional data processing tools are

inadequate httpenwikipediaorgwikiBig_data

bull Hadoop is becoming a traditional tool Above

definition is inadequate

bull ldquoBig Data refers to datasets and flows large enough

that has outpaced our capability to store process

analyze and understandrdquo Amir H Payberah

Swedish Institute of Computer Science (SICS)

16

2 Typical Big Data Stack

17

3 Apache Hadoopbull Apache Hadoop as an example of a Typical Big Data

Stack

bull Hadoop ecosystem = Hadoop Stack + many other tools

(either open source and free or commercial ones)

bull Big Data Ecosystem Dataset httpbigdataandreamostosiname

Incomplete but a useful list of Big Data related projects

packed into a JSON dataset

bull Hadoops Impact on Data Managements Future - Amr

Awadallah (Strata + Hadoop 2015) February 19 2015 Watch

video at 236 on lsquoHadoop Isnrsquot Just Hadoop Anymorersquo for a picture

representing the evolution of Apache Hadoop

httpswwwyoutubecomwatchv=1KvTZZAkHy0

18

4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stack

bull Apache Spark provides you Big Data computing and more

bull BYOS Bring Your Own Storage

bull BYOC Bring Your Own Cluster

bull Spark Core httpsparkbigdatacomcomponenttagstag11-core-spark

bull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-streaming

bull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sql

bull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllib

bull GraphX httpsparkbigdatacomcomponenttagstag6-graphx

bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned

19

5 Key Takeaways

1 Big Data Still one of the most inflated

buzzword

2 Typical Big Data Stack Big Data Stacks look

similar on paper Arenrsquot they

3 Apache Hadoop Hadoop is no longer

lsquosynonymousrsquo of Big Data

4 Apache Spark Emergence of the Apache

Spark ecosystem

20

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

21

1 Evolution of Programming APIs

bull MapReduce in Java is like assembly code of Big

Data httpwikiapacheorghadoopWordCount

bull Pig httppigapacheorg

bull Hive httphiveapacheorg

bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi

bull Cascading httpwwwcascadingorg

bull Scalding A Scala API for Cascading httptwittercomscalding

bull Crunch httpcrunchapacheorg

bull Scrunch httpcrunchapacheorgscrunchhtml

22

1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007

MapReduce v1 was the only choice as a compute model

(Execution Engine) on Hadoop Now we have in addition

to MapReduce v2 Tez Spark and Flink

23

bull Batch bull Batch

bull Interactive

bull Batch

bull Interactive

bull Near-Real

time

bull Batch

bull Interactive

bull Real-Time

bull Iterative

bull 1st

Generation

bull 2nd

Generation

bull 3rd

Generation

bull 4th

Generation

1 Evolution

bull This is how Hadoop MapReduce is branding itself ldquoA YARN-based system for parallel processing of large data sets httphadoopapacheorg

bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip

bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job

bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics

24

1 Evolution

bull Tez Hindi for ldquospeedrdquo

bull This is how Apache Tez is branding itself ldquoTheApache Tez project is aimed at building anapplication framework which allows for a complexdirected-acyclic-graph of tasks for processingdata It is currently built atop YARNrdquo

Source httptezapacheorg

bull Apachetrade Tez is an extensible framework for

building high performance batch and

interactive data processing applicationscoordinated by YARN in Apache Hadoop

25

1 Evolution

bull lsquoSparkrsquo for lightning fast speed

bull This is how Apache Spark is branding itselfldquoApache Sparktrade is a fast and general engine forlarge-scale data processingrdquo httpssparkapacheorg

bull Apache Spark is a general purpose clustercomputing framework its execution modelsupports wide variety of use cases batchinteractive near-real time

bull The rapid in-memory processing of resilientdistributed datasets (RDDs) is the ldquocorecapabilityrdquo of Apache Spark

26

1 Evolution Apache Flink

bull Flink German for ldquonimble swift speedyrdquo

bull This is how Apache Flink is branding itself ldquoFast andreliable large-scale data processing enginerdquo

bull Apache Flink httpflinkapacheorg offers

bull Batch and Streaming in the same system

bull Beyond DAGs (Cyclic operator graphs)

bull Powerful expressive APIs

bull Inside-the-system iterations

bull Full Hadoop compatibility

bull Automatic language independent optimizer

bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink

27

Hadoop MapReduce vs Tez vs Spark

Criteria

License Open Source

Apache 20 version

2x

Open Source

Apache 20

version 0x

Open Source

Apache 20 version

1x

Processing

Model

On-Disk (Disk-

based

parallelization)

Batch

On-Disk Batch

Interactive

In-Memory On-Disk

Batch Interactive

Streaming (Near Real-

Time)

Language written

in

Java Java Scala

API [Java Python

Scala] User-Facing

Java[

ISVEngineTool

builder]

[Scala Java Python]

User-Facing

Libraries None separate tools None [Spark Core Spark

Streaming Spark SQL

MLlib GraphX]

28

Hadoop MapReduce vs Tez vs Spark

Criteria

Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to

Hadoop

Ease of Use Difficult to program

needs abstractions

No Interactive mode

except Hive Pig

Difficult to program

No Interactive

mode except Hive

Pig

Easy to program

no need of

abstractions

Interactive mode

Compatibilit

y

to data types and data

sources is same

to data types and

data sources is

same

to data types and

data sources is

same

YARN

integration

YARN application Ground up YARN

application

Spark is moving

towards YARN

29

Hadoop MapReduce vs Tez vs Spark

Criteria

Deployment YARN YARN [Standalone YARN

SIMR Mesos hellip]

Performance - Good performance

when data fits into

memory

- performance

degradation otherwise

Security More features and

projects

More

features and

projects

Still in its infancy

30

Partial support

IV Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

31

2 Transition

bull Existing Hadoop MapReduce projects can

migrate to Spark and leverage Spark Core as

execution engine

1 You can often reuse your mapper and

reducer functions and just call them in

Spark from Java or Scala

2 You can translate your code from

MapReduce to Apache Spark How-to

Translate from MapReduce to Apache Sparkhttpblogclouderacomblog201409how-to-translate-from-mapreduce-to-

apache-spark

32

2 Transition

3 The following tools originally based on Hadoop

MapReduce are being ported to Apache Spark

bull Pig Hive Sqoop Cascading Crunch Mahout hellip

33

Pig on Spark (Spork)

bull Run Pig with ldquondashx sparkrdquo option for an easy migration

without development effort

bull Speed up your existing pig scripts on Spark ( Query

Logical Plan Physical Pan)

bull Leverage new Spark specific operators in Pig such as

Cache

bull Still leverage many existing Pig UDF libraries

bull Pig on Spark Umbrella Jira (Status Passed end-to-end test

cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059

bull Fix outstanding issues and address additional Spark functionality

through the community

bull lsquoPig on Sparkrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag19

34

Hive on Spark (Currently in Beta

Expected in Hive 110)

bull New alternative to using MapReduce or Tez

hivegt set hiveexecutionengine=spark

bull Help existing Hive applications running on

MapReduce or Tez easily migrate to Spark without

development effort

bull Exposes Spark users to a viable feature-rich de facto

standard SQL tool on Hadoop

bull Performance benefits especially for Hive queries

involving multiple reducer stages

bull Hive on Spark Umbrella Jira (Status Open) Q1 2015httpsissuesapacheorgjirabrowseHIVE-7292

35

Hive on Spark (Currently in Beta

Expected in Hive 110)

bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-

motivations-and-design-principles

bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Start

ed

bull Hive on Spark February 11 2015 Szehon Ho

Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark

bull Hive on spark is blazing fast or is it Carter Shanklin and

Mostapah Mokhtar (Hortonworks) February 20 2015httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final

bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12

36

Sqoop on Spark

(Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially

developed as a tool to transfer data from RDBMS to

Hadoop

bull The next version of Sqoop referred to as Sqoop2

supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under

discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Pro

posal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira

Status Work In Progress) The goal of this ticket is to support a

pluggable way to select the execution engine on which we can run

the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

37

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application

development platform for building data applications on

Hadoop

bull Support for Apache Spark is on the roadmap and will be

available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the

transition from CascadingScalding to Spark a little

easier by adding support for Cascading Taps Scalding

Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark

38

Apache Crunch

bull The Apache Crunch Java library provides a

framework for writing testing and running

MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a

SparkPipeline class making it easy to migrate

data processing applications from MapReduce

to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark

Pipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-

xtopicscdh_ig_running_crunch_with_sparkhtml

39

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce

Apache Mahout the original Machine Learning (ML)

library for Hadoop since 2009 is rejecting new

MapReduce algorithm

implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark

bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this

DSL are automatically optimized and executed in

parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for

Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

40

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov

April 2014

httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with

Mahout Scala and Spark Published on May 30 2014

httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-

with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)-

MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

41

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

42

3 IntegrationService Open Source Tool

StorageServi

ng Layer

Data Formats

Data

Ingestion

Services

Resource

Management

Search

SQL

43

3 Integration

bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851

44

3 Integration

bull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via

newAPIHadoopRDD Example HBaseTestscala from

Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach

esparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available

for reading from and writing to HBase without the need

of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with

Spark Status Still in experimentation and no timetable for

possible support httpblogclouderacomblog201412new-in-cloudera-

labs-sparkonhbase

45

3 Integration

bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark

RDDs to Cassandra tables and execute arbitrary CQL

queries in your Spark applications Supports also

integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration

is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra

46

3 Integration

bull Benchmark of Spark amp Cassandra Integration

using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume

data from Cassandra to spark and store Resilient

Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new

avenues

bull Kindling An Introduction to Spark with Cassandra

(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-

spark-with-cassandra

47

3 Integration

bull MongoDB is not directly served by Spark although

it can be used from Spark via an official Mongo-

Hadoop connector

bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-

insights

bull Spark SQL also provides indirect support via its

support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

48

3 Integration

bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from

Apache Spark (still experimental)

bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-

introduction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-

example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-

example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without

Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

49

3 Integration

bull Neo4j is a highly scalable robust (fully ACID) native graph

database

bull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015

httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015

httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph

Analytics By Kenny Bastani November 3 2014

httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

50

3 Integration YARN

bull YARN Yet Another Resource Negotiator Implicit

reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND

20summary20~20yarn20AND20status203D20OPEN20ORDER20

BY20priority20DESC0A

bull Some issues are critical ones

bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

51

3 Integration

bull Spark SQL provides built in support for Hivetables

bull Import relational data from Hive tables

bull Run SQL queries over imported data

bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120

bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib

52

3 Integration

bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to

address new use cases

bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data

sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query

in-memory data in Spark Embed Drill execution in a

Spark data pipeline

Source Whats Coming in 2015 for

Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

53

3 Integration

bull Apache Kafka is a high throughput distributed

messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka

Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming

Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-

example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka

54

3 Integration

bull Apache Flume is a streaming event data

ingestion system that is designed for Big Data

ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with

Flume There are two approaches to this

bull Approach 1 Flume-style Push-based Approach

bull Approach 2 (Experimental) Pull-based

Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

55

3 Integration

bull Spark SQL provides built in support for JSON that

is vastly simplifying the end-to-end-experience of

working with JSON data

bull Spark SQL can automatically infer the schema

of a JSON dataset and load it as a

SchemaRDD No more DDL Just point Spark

SQL to JSON files and query Starting Spark 13

SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-

support-in-spark-sqlhtml

56

3 Integration

bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows to

bull Import relational data from Parquet files

bull Run SQL queries over imported data

bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

57

3 Integration

bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015

bull Problem

bull Various inbound data sets

bull Data Layout can change without notice

bull New data sets can be added without notice

Result

bull Leverage Spark to dynamically split the data

bull Leverage Avro to store the data in a compact binary format

58

3 Integration Kite SDK

bull The Kite SDK provides high level abstractions to

work with datasets on Hadoop hiding many of

the details of compression codecs file formats

partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016

release so Spark jobs can read and write to Kite

datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

59

3 Integration

bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between

Elasticsearch and Apache Spark in the form of RDD that can

read data from Elasticsearch Also any RDD can be saved to

Elasticsearch as long as its content can be translated into

documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache

Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html

60

3 Integration

bull Apache Solr added a Spark-based indexing tool for

fast and easy indexing ingestion and serving

searchable complex data ldquoCrunchIndexerTool on

Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark

Crunch and Morphlines

bull Migrate ingestion of HDFS data into Solr from

MapReduce to Spark

bull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-

intosolrusingsparktrimmed

61

3 Integration

bull HUE is the open source Apache Hadoop Web UI

that lets users use Hadoop directly from their

browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark

Igniter lets users execute and monitor Spark jobs

directly from their browser and be more

productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-

hadoop-by-enrico-berti-at-big-data-spain-2014

62

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

63

4 Complementarity

Components of Hadoop ecosystem and Spark ecosystem

can work together each for what it is especially good at

rather than choosing one of them

64

Hadoop ecosystem Spark ecosystem

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

65

4 Complementarity +

bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesos

bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41

66

4 Complementarity +

References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN

cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache

Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-

resource-management

bull YARN vs MESOS Canrsquot We All Just Get

Along httpstrataconfcombig-data-conference-ca-

2015publicscheduledetail40620

67

4 Complementarity +

bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization

strategies (building the DAG with knowledge of data

distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the

need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with

YARN (resource chaining in clusters)

bull Tez supports enterprise security

68

4 Complementarity +

bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better

since it is more ldquostream orientedrdquo has more mature

shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory

parsed data it can be much better when we process

data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native

YARN Integration httphortonworkscomblogimproving-spark-data-

pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

69

4 Complementarity

bull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal

compute framework at each step in the big data

analytics process based on the type of platform the

attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on

November 13 2014 with Matt Schumpert Director of Product

Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution

Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-

right-execution-enginehtml

70

4 Complementarity

bull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles

Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-

2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to

Mainstream Apache Hadoop Adoption February 12 2015

httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-

Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms

February 23 2015

httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-

migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015

httpblogsyncsortcom201503framework-future-hadoop

71

5 Key Takeaways1 Evolution of compute models is still ongoing

Watch out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

72

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

73

1 File System

Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml

bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift

74

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 3: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

Agenda

I Motivation

II Big Data Typical Big Data

Stack Apache Hadoop

Apache Spark

III Spark with Hadoop

IV Spark without Hadoop

V More QampA

3

I Motivation

1 News

2 Surveys

3 Vendors

4 Analysts

5 Key Takeaways

4

1 Newsbull Is it Spark vs OR and Hadoop

bull Apache Spark Hadoop friend or foe

bull Apache Spark killer or savior of Apache Hadoop

bull Apache Sparks Marriage To Hadoop Will Be Bigger Than Kim And Kanye

bull Adios Hadoop Hola Spark

bull Apache Spark Moving on from Hadoop

bull Apache Spark Continues to Spread Beyond Hadoop

bull Escape From Hadoop

bull Spark promises to end up Hadoop but in a good way

5

2 Surveysbull Hadoops historic focus on batch processing of data

was well supported by MapReduce but there is an

appetite for more flexible developer tools to support

the larger market of mid-size datasets and use cases

that call for real-time processingrdquo 2015 Apache Spark

Survey by Typesafe January 27 2015

httpwwwmarketwiredcompress-releasesurvey-indicates-apache-spark-

gaining-developer-adoption-as-big-datas-projects-1986162htm

bull Apache Spark Preparing for the Next Wave of

Reactive Big Data January 27 2015 by Typesafe

httptypesafecomblogapache-spark-preparing-for-the-next-wave-of-reactive-

big-data

6

Apache Spark Survey 2015 by

Typesafe - Quick Snapshot

7

3 Vendors

8

bull Spark and Hadoop Working Together January 21

2014 by Ion Stoica httpsdatabrickscomblog20140121spark-and-

hadoophtml

bull Uniform API for diverse workloads over diverse

storage systems and runtimes

Source Slide 16 of lsquoSparks Role in the Big Data Ecosystem (Spark

Summit 2014) November 2014 Matei

Zahariahttpwwwslidesharenetdatabricksspark-summit2014

bull The goal of Apache Spark is to have one engine for all

data sources workloads and environmentsrdquo

Source Slide 15 of lsquoNew Directions for Apache Spark in 2015

February 20 2015 Strata + Hadoop Summit Matei Zaharia

httpwwwslidesharenetdatabricksnew-directions-for-apache-spark-in-2015

3 Vendorsbull ldquoSpark is already an excellent piece of software and is

advancing very quickly No vendor mdash no new project mdashis likely to catch up Chasing Spark would be a wasteof time and would delay availability of real-time analyticand processing services for no good reason rdquoSource MapReduce and Spark December 302013 httpvisionclouderacommapreduce-spark

bull ldquoApache Spark is an open source parallel dataprocessing framework that complements ApacheHadoop to make it easy to develop fast unified Big Dataapplications combining batch streaming and interactiveanalytics on all your datardquohttpwwwclouderacomcontentclouderaenproducts-and-servicescdhsparkhtml

9

3 Vendorsbull ldquoApache Spark is a general-purpose engine for large-

scale data processing Spark supports rapid application

development for big data and allows for code reuse

across batch interactive and streaming applications

Spark also provides advanced execution graphs with in-

memory pipelining to speed up end-to-end application

performancerdquo httpswwwmaprcomproductsapache-spark

bull MapR Adds Complete Apache Spark Stack to its

Distribution for Hadoophttpswwwmaprcomcompanypress-releasesmapr-adds-complete-apache-

spark-stack-its-distribution-hadoop

10

3 Vendorsbull ldquoApache Spark provides an elegant attractive development API and allows data workers to rapidly iterate over data via machine learning and other data science techniques that require fast in-memory data processingrdquo httphortonworkscomhadoopspark

bull Hortonworks A shared vision for Apache Spark on Hadoop October 212014httpsdatabrickscomblog20141031hortonworks-a-shared-vision-for-apache-spark-on-hadoophtml

bull ldquoAt Hortonworks we love Spark and want to help our customers leverage all its benefitsrdquo October 30th 2014httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

11

4 Analystsbull Is Apache Spark replacing Hadoop or complementing

existing Hadoop practice

bull Both are already happening

bull With uncertainty about ldquowhat is Hadooprdquo there is no

reason to think solution stacks built on Spark not

positioned as Hadoop will not continue to proliferate

as the technology matures

bull At the same time Hadoop distributions are all

embracing Spark and including it in their offerings

Source Hadoop Questions from Recent Webinar Span Spectrum

February 25 2015httpblogsgartnercommerv-adrian20150225hadoop-

questions-from-recent-webinar-span-spectrum

12

4 Analysts bull ldquoAfter hearing the confusion between Spark and

Hadoop one too many times I was inspired to write a report The Hadoop Ecosystem Overview Q4 2104

bull For those that have day jobs that donrsquot include constantly tracking Hadoop evolution I dove in and worked with Hadoop vendors and trusted consultants to create a framework

bull We divided the complex Hadoop ecosystem into a core set of tools that all work closely with data stored in Hadoop File System and extended group of components that leverage but do not require itrdquo Source Elephants Pigs Rhinos and Giraphs Oh My ndash Its Time To Get A Handle On Hadoop Posted by Brian Hopkins on November 26 2014

httpblogsforrestercombrian_hopkins14-11-26-elephants_pigs_rhinos_and_giraphs_oh_my_its_time_to_get_a_handle_on_hadoop

13

5 Key Takeaways

1 News Big Data is no longer a Hadoop

monopoly

2 Surveys Listen to what Spark developers are

saying

3 Vendors ltHadoop Vendorgt-tinted goggles

FUD is still being lsquoofferedrsquo by some Hadoop

vendors Claims need to be contextualized

4 Analysts Thorough understanding of the

market dynamics

14

II Big Data Typical Big Data

Stack Hadoop Spark

1 Big Data

2 Typical Big Data Stack

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways

15

1 Big Databull Big Data is still one of the most inflated buzzword of

the last years

bull Big Data is a broad term for data sets so large or

complex that traditional data processing tools are

inadequate httpenwikipediaorgwikiBig_data

bull Hadoop is becoming a traditional tool Above

definition is inadequate

bull ldquoBig Data refers to datasets and flows large enough

that has outpaced our capability to store process

analyze and understandrdquo Amir H Payberah

Swedish Institute of Computer Science (SICS)

16

2 Typical Big Data Stack

17

3 Apache Hadoopbull Apache Hadoop as an example of a Typical Big Data

Stack

bull Hadoop ecosystem = Hadoop Stack + many other tools

(either open source and free or commercial ones)

bull Big Data Ecosystem Dataset httpbigdataandreamostosiname

Incomplete but a useful list of Big Data related projects

packed into a JSON dataset

bull Hadoops Impact on Data Managements Future - Amr

Awadallah (Strata + Hadoop 2015) February 19 2015 Watch

video at 236 on lsquoHadoop Isnrsquot Just Hadoop Anymorersquo for a picture

representing the evolution of Apache Hadoop

httpswwwyoutubecomwatchv=1KvTZZAkHy0

18

4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stack

bull Apache Spark provides you Big Data computing and more

bull BYOS Bring Your Own Storage

bull BYOC Bring Your Own Cluster

bull Spark Core httpsparkbigdatacomcomponenttagstag11-core-spark

bull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-streaming

bull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sql

bull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllib

bull GraphX httpsparkbigdatacomcomponenttagstag6-graphx

bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned

19

5 Key Takeaways

1 Big Data Still one of the most inflated

buzzword

2 Typical Big Data Stack Big Data Stacks look

similar on paper Arenrsquot they

3 Apache Hadoop Hadoop is no longer

lsquosynonymousrsquo of Big Data

4 Apache Spark Emergence of the Apache

Spark ecosystem

20

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

21

1 Evolution of Programming APIs

bull MapReduce in Java is like assembly code of Big

Data httpwikiapacheorghadoopWordCount

bull Pig httppigapacheorg

bull Hive httphiveapacheorg

bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi

bull Cascading httpwwwcascadingorg

bull Scalding A Scala API for Cascading httptwittercomscalding

bull Crunch httpcrunchapacheorg

bull Scrunch httpcrunchapacheorgscrunchhtml

22

1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007

MapReduce v1 was the only choice as a compute model

(Execution Engine) on Hadoop Now we have in addition

to MapReduce v2 Tez Spark and Flink

23

bull Batch bull Batch

bull Interactive

bull Batch

bull Interactive

bull Near-Real

time

bull Batch

bull Interactive

bull Real-Time

bull Iterative

bull 1st

Generation

bull 2nd

Generation

bull 3rd

Generation

bull 4th

Generation

1 Evolution

bull This is how Hadoop MapReduce is branding itself ldquoA YARN-based system for parallel processing of large data sets httphadoopapacheorg

bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip

bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job

bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics

24

1 Evolution

bull Tez Hindi for ldquospeedrdquo

bull This is how Apache Tez is branding itself ldquoTheApache Tez project is aimed at building anapplication framework which allows for a complexdirected-acyclic-graph of tasks for processingdata It is currently built atop YARNrdquo

Source httptezapacheorg

bull Apachetrade Tez is an extensible framework for

building high performance batch and

interactive data processing applicationscoordinated by YARN in Apache Hadoop

25

1 Evolution

bull lsquoSparkrsquo for lightning fast speed

bull This is how Apache Spark is branding itselfldquoApache Sparktrade is a fast and general engine forlarge-scale data processingrdquo httpssparkapacheorg

bull Apache Spark is a general purpose clustercomputing framework its execution modelsupports wide variety of use cases batchinteractive near-real time

bull The rapid in-memory processing of resilientdistributed datasets (RDDs) is the ldquocorecapabilityrdquo of Apache Spark

26

1 Evolution Apache Flink

bull Flink German for ldquonimble swift speedyrdquo

bull This is how Apache Flink is branding itself ldquoFast andreliable large-scale data processing enginerdquo

bull Apache Flink httpflinkapacheorg offers

bull Batch and Streaming in the same system

bull Beyond DAGs (Cyclic operator graphs)

bull Powerful expressive APIs

bull Inside-the-system iterations

bull Full Hadoop compatibility

bull Automatic language independent optimizer

bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink

27

Hadoop MapReduce vs Tez vs Spark

Criteria

License Open Source

Apache 20 version

2x

Open Source

Apache 20

version 0x

Open Source

Apache 20 version

1x

Processing

Model

On-Disk (Disk-

based

parallelization)

Batch

On-Disk Batch

Interactive

In-Memory On-Disk

Batch Interactive

Streaming (Near Real-

Time)

Language written

in

Java Java Scala

API [Java Python

Scala] User-Facing

Java[

ISVEngineTool

builder]

[Scala Java Python]

User-Facing

Libraries None separate tools None [Spark Core Spark

Streaming Spark SQL

MLlib GraphX]

28

Hadoop MapReduce vs Tez vs Spark

Criteria

Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to

Hadoop

Ease of Use Difficult to program

needs abstractions

No Interactive mode

except Hive Pig

Difficult to program

No Interactive

mode except Hive

Pig

Easy to program

no need of

abstractions

Interactive mode

Compatibilit

y

to data types and data

sources is same

to data types and

data sources is

same

to data types and

data sources is

same

YARN

integration

YARN application Ground up YARN

application

Spark is moving

towards YARN

29

Hadoop MapReduce vs Tez vs Spark

Criteria

Deployment YARN YARN [Standalone YARN

SIMR Mesos hellip]

Performance - Good performance

when data fits into

memory

- performance

degradation otherwise

Security More features and

projects

More

features and

projects

Still in its infancy

30

Partial support

IV Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

31

2 Transition

bull Existing Hadoop MapReduce projects can

migrate to Spark and leverage Spark Core as

execution engine

1 You can often reuse your mapper and

reducer functions and just call them in

Spark from Java or Scala

2 You can translate your code from

MapReduce to Apache Spark How-to

Translate from MapReduce to Apache Sparkhttpblogclouderacomblog201409how-to-translate-from-mapreduce-to-

apache-spark

32

2 Transition

3 The following tools originally based on Hadoop

MapReduce are being ported to Apache Spark

bull Pig Hive Sqoop Cascading Crunch Mahout hellip

33

Pig on Spark (Spork)

bull Run Pig with ldquondashx sparkrdquo option for an easy migration

without development effort

bull Speed up your existing pig scripts on Spark ( Query

Logical Plan Physical Pan)

bull Leverage new Spark specific operators in Pig such as

Cache

bull Still leverage many existing Pig UDF libraries

bull Pig on Spark Umbrella Jira (Status Passed end-to-end test

cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059

bull Fix outstanding issues and address additional Spark functionality

through the community

bull lsquoPig on Sparkrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag19

34

Hive on Spark (Currently in Beta

Expected in Hive 110)

bull New alternative to using MapReduce or Tez

hivegt set hiveexecutionengine=spark

bull Help existing Hive applications running on

MapReduce or Tez easily migrate to Spark without

development effort

bull Exposes Spark users to a viable feature-rich de facto

standard SQL tool on Hadoop

bull Performance benefits especially for Hive queries

involving multiple reducer stages

bull Hive on Spark Umbrella Jira (Status Open) Q1 2015httpsissuesapacheorgjirabrowseHIVE-7292

35

Hive on Spark (Currently in Beta

Expected in Hive 110)

bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-

motivations-and-design-principles

bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Start

ed

bull Hive on Spark February 11 2015 Szehon Ho

Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark

bull Hive on spark is blazing fast or is it Carter Shanklin and

Mostapah Mokhtar (Hortonworks) February 20 2015httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final

bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12

36

Sqoop on Spark

(Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially

developed as a tool to transfer data from RDBMS to

Hadoop

bull The next version of Sqoop referred to as Sqoop2

supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under

discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Pro

posal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira

Status Work In Progress) The goal of this ticket is to support a

pluggable way to select the execution engine on which we can run

the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

37

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application

development platform for building data applications on

Hadoop

bull Support for Apache Spark is on the roadmap and will be

available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the

transition from CascadingScalding to Spark a little

easier by adding support for Cascading Taps Scalding

Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark

38

Apache Crunch

bull The Apache Crunch Java library provides a

framework for writing testing and running

MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a

SparkPipeline class making it easy to migrate

data processing applications from MapReduce

to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark

Pipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-

xtopicscdh_ig_running_crunch_with_sparkhtml

39

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce

Apache Mahout the original Machine Learning (ML)

library for Hadoop since 2009 is rejecting new

MapReduce algorithm

implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark

bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this

DSL are automatically optimized and executed in

parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for

Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

40

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov

April 2014

httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with

Mahout Scala and Spark Published on May 30 2014

httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-

with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)-

MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

41

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

42

3 IntegrationService Open Source Tool

StorageServi

ng Layer

Data Formats

Data

Ingestion

Services

Resource

Management

Search

SQL

43

3 Integration

bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851

44

3 Integration

bull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via

newAPIHadoopRDD Example HBaseTestscala from

Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach

esparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available

for reading from and writing to HBase without the need

of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with

Spark Status Still in experimentation and no timetable for

possible support httpblogclouderacomblog201412new-in-cloudera-

labs-sparkonhbase

45

3 Integration

bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark

RDDs to Cassandra tables and execute arbitrary CQL

queries in your Spark applications Supports also

integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration

is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra

46

3 Integration

bull Benchmark of Spark amp Cassandra Integration

using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume

data from Cassandra to spark and store Resilient

Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new

avenues

bull Kindling An Introduction to Spark with Cassandra

(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-

spark-with-cassandra

47

3 Integration

bull MongoDB is not directly served by Spark although

it can be used from Spark via an official Mongo-

Hadoop connector

bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-

insights

bull Spark SQL also provides indirect support via its

support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

48

3 Integration

bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from

Apache Spark (still experimental)

bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-

introduction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-

example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-

example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without

Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

49

3 Integration

bull Neo4j is a highly scalable robust (fully ACID) native graph

database

bull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015

httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015

httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph

Analytics By Kenny Bastani November 3 2014

httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

50

3 Integration YARN

bull YARN Yet Another Resource Negotiator Implicit

reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND

20summary20~20yarn20AND20status203D20OPEN20ORDER20

BY20priority20DESC0A

bull Some issues are critical ones

bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

51

3 Integration

bull Spark SQL provides built in support for Hivetables

bull Import relational data from Hive tables

bull Run SQL queries over imported data

bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120

bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib

52

3 Integration

bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to

address new use cases

bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data

sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query

in-memory data in Spark Embed Drill execution in a

Spark data pipeline

Source Whats Coming in 2015 for

Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

53

3 Integration

bull Apache Kafka is a high throughput distributed

messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka

Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming

Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-

example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka

54

3 Integration

bull Apache Flume is a streaming event data

ingestion system that is designed for Big Data

ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with

Flume There are two approaches to this

bull Approach 1 Flume-style Push-based Approach

bull Approach 2 (Experimental) Pull-based

Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

55

3 Integration

bull Spark SQL provides built in support for JSON that

is vastly simplifying the end-to-end-experience of

working with JSON data

bull Spark SQL can automatically infer the schema

of a JSON dataset and load it as a

SchemaRDD No more DDL Just point Spark

SQL to JSON files and query Starting Spark 13

SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-

support-in-spark-sqlhtml

56

3 Integration

bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows to

bull Import relational data from Parquet files

bull Run SQL queries over imported data

bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

57

3 Integration

bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015

bull Problem

bull Various inbound data sets

bull Data Layout can change without notice

bull New data sets can be added without notice

Result

bull Leverage Spark to dynamically split the data

bull Leverage Avro to store the data in a compact binary format

58

3 Integration Kite SDK

bull The Kite SDK provides high level abstractions to

work with datasets on Hadoop hiding many of

the details of compression codecs file formats

partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016

release so Spark jobs can read and write to Kite

datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

59

3 Integration

bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between

Elasticsearch and Apache Spark in the form of RDD that can

read data from Elasticsearch Also any RDD can be saved to

Elasticsearch as long as its content can be translated into

documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache

Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html

60

3 Integration

bull Apache Solr added a Spark-based indexing tool for

fast and easy indexing ingestion and serving

searchable complex data ldquoCrunchIndexerTool on

Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark

Crunch and Morphlines

bull Migrate ingestion of HDFS data into Solr from

MapReduce to Spark

bull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-

intosolrusingsparktrimmed

61

3 Integration

bull HUE is the open source Apache Hadoop Web UI

that lets users use Hadoop directly from their

browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark

Igniter lets users execute and monitor Spark jobs

directly from their browser and be more

productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-

hadoop-by-enrico-berti-at-big-data-spain-2014

62

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

63

4 Complementarity

Components of Hadoop ecosystem and Spark ecosystem

can work together each for what it is especially good at

rather than choosing one of them

64

Hadoop ecosystem Spark ecosystem

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

65

4 Complementarity +

bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesos

bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41

66

4 Complementarity +

References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN

cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache

Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-

resource-management

bull YARN vs MESOS Canrsquot We All Just Get

Along httpstrataconfcombig-data-conference-ca-

2015publicscheduledetail40620

67

4 Complementarity +

bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization

strategies (building the DAG with knowledge of data

distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the

need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with

YARN (resource chaining in clusters)

bull Tez supports enterprise security

68

4 Complementarity +

bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better

since it is more ldquostream orientedrdquo has more mature

shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory

parsed data it can be much better when we process

data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native

YARN Integration httphortonworkscomblogimproving-spark-data-

pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

69

4 Complementarity

bull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal

compute framework at each step in the big data

analytics process based on the type of platform the

attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on

November 13 2014 with Matt Schumpert Director of Product

Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution

Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-

right-execution-enginehtml

70

4 Complementarity

bull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles

Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-

2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to

Mainstream Apache Hadoop Adoption February 12 2015

httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-

Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms

February 23 2015

httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-

migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015

httpblogsyncsortcom201503framework-future-hadoop

71

5 Key Takeaways1 Evolution of compute models is still ongoing

Watch out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

72

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

73

1 File System

Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml

bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift

74

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 4: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

I Motivation

1 News

2 Surveys

3 Vendors

4 Analysts

5 Key Takeaways

4

1 Newsbull Is it Spark vs OR and Hadoop

bull Apache Spark Hadoop friend or foe

bull Apache Spark killer or savior of Apache Hadoop

bull Apache Sparks Marriage To Hadoop Will Be Bigger Than Kim And Kanye

bull Adios Hadoop Hola Spark

bull Apache Spark Moving on from Hadoop

bull Apache Spark Continues to Spread Beyond Hadoop

bull Escape From Hadoop

bull Spark promises to end up Hadoop but in a good way

5

2 Surveysbull Hadoops historic focus on batch processing of data

was well supported by MapReduce but there is an

appetite for more flexible developer tools to support

the larger market of mid-size datasets and use cases

that call for real-time processingrdquo 2015 Apache Spark

Survey by Typesafe January 27 2015

httpwwwmarketwiredcompress-releasesurvey-indicates-apache-spark-

gaining-developer-adoption-as-big-datas-projects-1986162htm

bull Apache Spark Preparing for the Next Wave of

Reactive Big Data January 27 2015 by Typesafe

httptypesafecomblogapache-spark-preparing-for-the-next-wave-of-reactive-

big-data

6

Apache Spark Survey 2015 by

Typesafe - Quick Snapshot

7

3 Vendors

8

bull Spark and Hadoop Working Together January 21

2014 by Ion Stoica httpsdatabrickscomblog20140121spark-and-

hadoophtml

bull Uniform API for diverse workloads over diverse

storage systems and runtimes

Source Slide 16 of lsquoSparks Role in the Big Data Ecosystem (Spark

Summit 2014) November 2014 Matei

Zahariahttpwwwslidesharenetdatabricksspark-summit2014

bull The goal of Apache Spark is to have one engine for all

data sources workloads and environmentsrdquo

Source Slide 15 of lsquoNew Directions for Apache Spark in 2015

February 20 2015 Strata + Hadoop Summit Matei Zaharia

httpwwwslidesharenetdatabricksnew-directions-for-apache-spark-in-2015

3 Vendorsbull ldquoSpark is already an excellent piece of software and is

advancing very quickly No vendor mdash no new project mdashis likely to catch up Chasing Spark would be a wasteof time and would delay availability of real-time analyticand processing services for no good reason rdquoSource MapReduce and Spark December 302013 httpvisionclouderacommapreduce-spark

bull ldquoApache Spark is an open source parallel dataprocessing framework that complements ApacheHadoop to make it easy to develop fast unified Big Dataapplications combining batch streaming and interactiveanalytics on all your datardquohttpwwwclouderacomcontentclouderaenproducts-and-servicescdhsparkhtml

9

3 Vendorsbull ldquoApache Spark is a general-purpose engine for large-

scale data processing Spark supports rapid application

development for big data and allows for code reuse

across batch interactive and streaming applications

Spark also provides advanced execution graphs with in-

memory pipelining to speed up end-to-end application

performancerdquo httpswwwmaprcomproductsapache-spark

bull MapR Adds Complete Apache Spark Stack to its

Distribution for Hadoophttpswwwmaprcomcompanypress-releasesmapr-adds-complete-apache-

spark-stack-its-distribution-hadoop

10

3 Vendorsbull ldquoApache Spark provides an elegant attractive development API and allows data workers to rapidly iterate over data via machine learning and other data science techniques that require fast in-memory data processingrdquo httphortonworkscomhadoopspark

bull Hortonworks A shared vision for Apache Spark on Hadoop October 212014httpsdatabrickscomblog20141031hortonworks-a-shared-vision-for-apache-spark-on-hadoophtml

bull ldquoAt Hortonworks we love Spark and want to help our customers leverage all its benefitsrdquo October 30th 2014httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

11

4 Analystsbull Is Apache Spark replacing Hadoop or complementing

existing Hadoop practice

bull Both are already happening

bull With uncertainty about ldquowhat is Hadooprdquo there is no

reason to think solution stacks built on Spark not

positioned as Hadoop will not continue to proliferate

as the technology matures

bull At the same time Hadoop distributions are all

embracing Spark and including it in their offerings

Source Hadoop Questions from Recent Webinar Span Spectrum

February 25 2015httpblogsgartnercommerv-adrian20150225hadoop-

questions-from-recent-webinar-span-spectrum

12

4 Analysts bull ldquoAfter hearing the confusion between Spark and

Hadoop one too many times I was inspired to write a report The Hadoop Ecosystem Overview Q4 2104

bull For those that have day jobs that donrsquot include constantly tracking Hadoop evolution I dove in and worked with Hadoop vendors and trusted consultants to create a framework

bull We divided the complex Hadoop ecosystem into a core set of tools that all work closely with data stored in Hadoop File System and extended group of components that leverage but do not require itrdquo Source Elephants Pigs Rhinos and Giraphs Oh My ndash Its Time To Get A Handle On Hadoop Posted by Brian Hopkins on November 26 2014

httpblogsforrestercombrian_hopkins14-11-26-elephants_pigs_rhinos_and_giraphs_oh_my_its_time_to_get_a_handle_on_hadoop

13

5 Key Takeaways

1 News Big Data is no longer a Hadoop

monopoly

2 Surveys Listen to what Spark developers are

saying

3 Vendors ltHadoop Vendorgt-tinted goggles

FUD is still being lsquoofferedrsquo by some Hadoop

vendors Claims need to be contextualized

4 Analysts Thorough understanding of the

market dynamics

14

II Big Data Typical Big Data

Stack Hadoop Spark

1 Big Data

2 Typical Big Data Stack

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways

15

1 Big Databull Big Data is still one of the most inflated buzzword of

the last years

bull Big Data is a broad term for data sets so large or

complex that traditional data processing tools are

inadequate httpenwikipediaorgwikiBig_data

bull Hadoop is becoming a traditional tool Above

definition is inadequate

bull ldquoBig Data refers to datasets and flows large enough

that has outpaced our capability to store process

analyze and understandrdquo Amir H Payberah

Swedish Institute of Computer Science (SICS)

16

2 Typical Big Data Stack

17

3 Apache Hadoopbull Apache Hadoop as an example of a Typical Big Data

Stack

bull Hadoop ecosystem = Hadoop Stack + many other tools

(either open source and free or commercial ones)

bull Big Data Ecosystem Dataset httpbigdataandreamostosiname

Incomplete but a useful list of Big Data related projects

packed into a JSON dataset

bull Hadoops Impact on Data Managements Future - Amr

Awadallah (Strata + Hadoop 2015) February 19 2015 Watch

video at 236 on lsquoHadoop Isnrsquot Just Hadoop Anymorersquo for a picture

representing the evolution of Apache Hadoop

httpswwwyoutubecomwatchv=1KvTZZAkHy0

18

4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stack

bull Apache Spark provides you Big Data computing and more

bull BYOS Bring Your Own Storage

bull BYOC Bring Your Own Cluster

bull Spark Core httpsparkbigdatacomcomponenttagstag11-core-spark

bull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-streaming

bull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sql

bull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllib

bull GraphX httpsparkbigdatacomcomponenttagstag6-graphx

bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned

19

5 Key Takeaways

1 Big Data Still one of the most inflated

buzzword

2 Typical Big Data Stack Big Data Stacks look

similar on paper Arenrsquot they

3 Apache Hadoop Hadoop is no longer

lsquosynonymousrsquo of Big Data

4 Apache Spark Emergence of the Apache

Spark ecosystem

20

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

21

1 Evolution of Programming APIs

bull MapReduce in Java is like assembly code of Big

Data httpwikiapacheorghadoopWordCount

bull Pig httppigapacheorg

bull Hive httphiveapacheorg

bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi

bull Cascading httpwwwcascadingorg

bull Scalding A Scala API for Cascading httptwittercomscalding

bull Crunch httpcrunchapacheorg

bull Scrunch httpcrunchapacheorgscrunchhtml

22

1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007

MapReduce v1 was the only choice as a compute model

(Execution Engine) on Hadoop Now we have in addition

to MapReduce v2 Tez Spark and Flink

23

bull Batch bull Batch

bull Interactive

bull Batch

bull Interactive

bull Near-Real

time

bull Batch

bull Interactive

bull Real-Time

bull Iterative

bull 1st

Generation

bull 2nd

Generation

bull 3rd

Generation

bull 4th

Generation

1 Evolution

bull This is how Hadoop MapReduce is branding itself ldquoA YARN-based system for parallel processing of large data sets httphadoopapacheorg

bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip

bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job

bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics

24

1 Evolution

bull Tez Hindi for ldquospeedrdquo

bull This is how Apache Tez is branding itself ldquoTheApache Tez project is aimed at building anapplication framework which allows for a complexdirected-acyclic-graph of tasks for processingdata It is currently built atop YARNrdquo

Source httptezapacheorg

bull Apachetrade Tez is an extensible framework for

building high performance batch and

interactive data processing applicationscoordinated by YARN in Apache Hadoop

25

1 Evolution

bull lsquoSparkrsquo for lightning fast speed

bull This is how Apache Spark is branding itselfldquoApache Sparktrade is a fast and general engine forlarge-scale data processingrdquo httpssparkapacheorg

bull Apache Spark is a general purpose clustercomputing framework its execution modelsupports wide variety of use cases batchinteractive near-real time

bull The rapid in-memory processing of resilientdistributed datasets (RDDs) is the ldquocorecapabilityrdquo of Apache Spark

26

1 Evolution Apache Flink

bull Flink German for ldquonimble swift speedyrdquo

bull This is how Apache Flink is branding itself ldquoFast andreliable large-scale data processing enginerdquo

bull Apache Flink httpflinkapacheorg offers

bull Batch and Streaming in the same system

bull Beyond DAGs (Cyclic operator graphs)

bull Powerful expressive APIs

bull Inside-the-system iterations

bull Full Hadoop compatibility

bull Automatic language independent optimizer

bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink

27

Hadoop MapReduce vs Tez vs Spark

Criteria

License Open Source

Apache 20 version

2x

Open Source

Apache 20

version 0x

Open Source

Apache 20 version

1x

Processing

Model

On-Disk (Disk-

based

parallelization)

Batch

On-Disk Batch

Interactive

In-Memory On-Disk

Batch Interactive

Streaming (Near Real-

Time)

Language written

in

Java Java Scala

API [Java Python

Scala] User-Facing

Java[

ISVEngineTool

builder]

[Scala Java Python]

User-Facing

Libraries None separate tools None [Spark Core Spark

Streaming Spark SQL

MLlib GraphX]

28

Hadoop MapReduce vs Tez vs Spark

Criteria

Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to

Hadoop

Ease of Use Difficult to program

needs abstractions

No Interactive mode

except Hive Pig

Difficult to program

No Interactive

mode except Hive

Pig

Easy to program

no need of

abstractions

Interactive mode

Compatibilit

y

to data types and data

sources is same

to data types and

data sources is

same

to data types and

data sources is

same

YARN

integration

YARN application Ground up YARN

application

Spark is moving

towards YARN

29

Hadoop MapReduce vs Tez vs Spark

Criteria

Deployment YARN YARN [Standalone YARN

SIMR Mesos hellip]

Performance - Good performance

when data fits into

memory

- performance

degradation otherwise

Security More features and

projects

More

features and

projects

Still in its infancy

30

Partial support

IV Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

31

2 Transition

bull Existing Hadoop MapReduce projects can

migrate to Spark and leverage Spark Core as

execution engine

1 You can often reuse your mapper and

reducer functions and just call them in

Spark from Java or Scala

2 You can translate your code from

MapReduce to Apache Spark How-to

Translate from MapReduce to Apache Sparkhttpblogclouderacomblog201409how-to-translate-from-mapreduce-to-

apache-spark

32

2 Transition

3 The following tools originally based on Hadoop

MapReduce are being ported to Apache Spark

bull Pig Hive Sqoop Cascading Crunch Mahout hellip

33

Pig on Spark (Spork)

bull Run Pig with ldquondashx sparkrdquo option for an easy migration

without development effort

bull Speed up your existing pig scripts on Spark ( Query

Logical Plan Physical Pan)

bull Leverage new Spark specific operators in Pig such as

Cache

bull Still leverage many existing Pig UDF libraries

bull Pig on Spark Umbrella Jira (Status Passed end-to-end test

cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059

bull Fix outstanding issues and address additional Spark functionality

through the community

bull lsquoPig on Sparkrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag19

34

Hive on Spark (Currently in Beta

Expected in Hive 110)

bull New alternative to using MapReduce or Tez

hivegt set hiveexecutionengine=spark

bull Help existing Hive applications running on

MapReduce or Tez easily migrate to Spark without

development effort

bull Exposes Spark users to a viable feature-rich de facto

standard SQL tool on Hadoop

bull Performance benefits especially for Hive queries

involving multiple reducer stages

bull Hive on Spark Umbrella Jira (Status Open) Q1 2015httpsissuesapacheorgjirabrowseHIVE-7292

35

Hive on Spark (Currently in Beta

Expected in Hive 110)

bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-

motivations-and-design-principles

bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Start

ed

bull Hive on Spark February 11 2015 Szehon Ho

Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark

bull Hive on spark is blazing fast or is it Carter Shanklin and

Mostapah Mokhtar (Hortonworks) February 20 2015httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final

bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12

36

Sqoop on Spark

(Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially

developed as a tool to transfer data from RDBMS to

Hadoop

bull The next version of Sqoop referred to as Sqoop2

supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under

discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Pro

posal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira

Status Work In Progress) The goal of this ticket is to support a

pluggable way to select the execution engine on which we can run

the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

37

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application

development platform for building data applications on

Hadoop

bull Support for Apache Spark is on the roadmap and will be

available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the

transition from CascadingScalding to Spark a little

easier by adding support for Cascading Taps Scalding

Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark

38

Apache Crunch

bull The Apache Crunch Java library provides a

framework for writing testing and running

MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a

SparkPipeline class making it easy to migrate

data processing applications from MapReduce

to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark

Pipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-

xtopicscdh_ig_running_crunch_with_sparkhtml

39

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce

Apache Mahout the original Machine Learning (ML)

library for Hadoop since 2009 is rejecting new

MapReduce algorithm

implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark

bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this

DSL are automatically optimized and executed in

parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for

Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

40

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov

April 2014

httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with

Mahout Scala and Spark Published on May 30 2014

httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-

with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)-

MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

41

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

42

3 IntegrationService Open Source Tool

StorageServi

ng Layer

Data Formats

Data

Ingestion

Services

Resource

Management

Search

SQL

43

3 Integration

bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851

44

3 Integration

bull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via

newAPIHadoopRDD Example HBaseTestscala from

Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach

esparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available

for reading from and writing to HBase without the need

of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with

Spark Status Still in experimentation and no timetable for

possible support httpblogclouderacomblog201412new-in-cloudera-

labs-sparkonhbase

45

3 Integration

bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark

RDDs to Cassandra tables and execute arbitrary CQL

queries in your Spark applications Supports also

integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration

is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra

46

3 Integration

bull Benchmark of Spark amp Cassandra Integration

using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume

data from Cassandra to spark and store Resilient

Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new

avenues

bull Kindling An Introduction to Spark with Cassandra

(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-

spark-with-cassandra

47

3 Integration

bull MongoDB is not directly served by Spark although

it can be used from Spark via an official Mongo-

Hadoop connector

bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-

insights

bull Spark SQL also provides indirect support via its

support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

48

3 Integration

bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from

Apache Spark (still experimental)

bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-

introduction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-

example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-

example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without

Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

49

3 Integration

bull Neo4j is a highly scalable robust (fully ACID) native graph

database

bull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015

httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015

httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph

Analytics By Kenny Bastani November 3 2014

httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

50

3 Integration YARN

bull YARN Yet Another Resource Negotiator Implicit

reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND

20summary20~20yarn20AND20status203D20OPEN20ORDER20

BY20priority20DESC0A

bull Some issues are critical ones

bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

51

3 Integration

bull Spark SQL provides built in support for Hivetables

bull Import relational data from Hive tables

bull Run SQL queries over imported data

bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120

bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib

52

3 Integration

bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to

address new use cases

bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data

sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query

in-memory data in Spark Embed Drill execution in a

Spark data pipeline

Source Whats Coming in 2015 for

Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

53

3 Integration

bull Apache Kafka is a high throughput distributed

messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka

Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming

Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-

example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka

54

3 Integration

bull Apache Flume is a streaming event data

ingestion system that is designed for Big Data

ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with

Flume There are two approaches to this

bull Approach 1 Flume-style Push-based Approach

bull Approach 2 (Experimental) Pull-based

Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

55

3 Integration

bull Spark SQL provides built in support for JSON that

is vastly simplifying the end-to-end-experience of

working with JSON data

bull Spark SQL can automatically infer the schema

of a JSON dataset and load it as a

SchemaRDD No more DDL Just point Spark

SQL to JSON files and query Starting Spark 13

SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-

support-in-spark-sqlhtml

56

3 Integration

bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows to

bull Import relational data from Parquet files

bull Run SQL queries over imported data

bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

57

3 Integration

bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015

bull Problem

bull Various inbound data sets

bull Data Layout can change without notice

bull New data sets can be added without notice

Result

bull Leverage Spark to dynamically split the data

bull Leverage Avro to store the data in a compact binary format

58

3 Integration Kite SDK

bull The Kite SDK provides high level abstractions to

work with datasets on Hadoop hiding many of

the details of compression codecs file formats

partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016

release so Spark jobs can read and write to Kite

datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

59

3 Integration

bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between

Elasticsearch and Apache Spark in the form of RDD that can

read data from Elasticsearch Also any RDD can be saved to

Elasticsearch as long as its content can be translated into

documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache

Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html

60

3 Integration

bull Apache Solr added a Spark-based indexing tool for

fast and easy indexing ingestion and serving

searchable complex data ldquoCrunchIndexerTool on

Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark

Crunch and Morphlines

bull Migrate ingestion of HDFS data into Solr from

MapReduce to Spark

bull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-

intosolrusingsparktrimmed

61

3 Integration

bull HUE is the open source Apache Hadoop Web UI

that lets users use Hadoop directly from their

browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark

Igniter lets users execute and monitor Spark jobs

directly from their browser and be more

productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-

hadoop-by-enrico-berti-at-big-data-spain-2014

62

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

63

4 Complementarity

Components of Hadoop ecosystem and Spark ecosystem

can work together each for what it is especially good at

rather than choosing one of them

64

Hadoop ecosystem Spark ecosystem

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

65

4 Complementarity +

bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesos

bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41

66

4 Complementarity +

References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN

cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache

Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-

resource-management

bull YARN vs MESOS Canrsquot We All Just Get

Along httpstrataconfcombig-data-conference-ca-

2015publicscheduledetail40620

67

4 Complementarity +

bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization

strategies (building the DAG with knowledge of data

distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the

need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with

YARN (resource chaining in clusters)

bull Tez supports enterprise security

68

4 Complementarity +

bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better

since it is more ldquostream orientedrdquo has more mature

shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory

parsed data it can be much better when we process

data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native

YARN Integration httphortonworkscomblogimproving-spark-data-

pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

69

4 Complementarity

bull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal

compute framework at each step in the big data

analytics process based on the type of platform the

attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on

November 13 2014 with Matt Schumpert Director of Product

Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution

Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-

right-execution-enginehtml

70

4 Complementarity

bull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles

Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-

2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to

Mainstream Apache Hadoop Adoption February 12 2015

httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-

Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms

February 23 2015

httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-

migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015

httpblogsyncsortcom201503framework-future-hadoop

71

5 Key Takeaways1 Evolution of compute models is still ongoing

Watch out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

72

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

73

1 File System

Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml

bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift

74

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 5: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

1 Newsbull Is it Spark vs OR and Hadoop

bull Apache Spark Hadoop friend or foe

bull Apache Spark killer or savior of Apache Hadoop

bull Apache Sparks Marriage To Hadoop Will Be Bigger Than Kim And Kanye

bull Adios Hadoop Hola Spark

bull Apache Spark Moving on from Hadoop

bull Apache Spark Continues to Spread Beyond Hadoop

bull Escape From Hadoop

bull Spark promises to end up Hadoop but in a good way

5

2 Surveysbull Hadoops historic focus on batch processing of data

was well supported by MapReduce but there is an

appetite for more flexible developer tools to support

the larger market of mid-size datasets and use cases

that call for real-time processingrdquo 2015 Apache Spark

Survey by Typesafe January 27 2015

httpwwwmarketwiredcompress-releasesurvey-indicates-apache-spark-

gaining-developer-adoption-as-big-datas-projects-1986162htm

bull Apache Spark Preparing for the Next Wave of

Reactive Big Data January 27 2015 by Typesafe

httptypesafecomblogapache-spark-preparing-for-the-next-wave-of-reactive-

big-data

6

Apache Spark Survey 2015 by

Typesafe - Quick Snapshot

7

3 Vendors

8

bull Spark and Hadoop Working Together January 21

2014 by Ion Stoica httpsdatabrickscomblog20140121spark-and-

hadoophtml

bull Uniform API for diverse workloads over diverse

storage systems and runtimes

Source Slide 16 of lsquoSparks Role in the Big Data Ecosystem (Spark

Summit 2014) November 2014 Matei

Zahariahttpwwwslidesharenetdatabricksspark-summit2014

bull The goal of Apache Spark is to have one engine for all

data sources workloads and environmentsrdquo

Source Slide 15 of lsquoNew Directions for Apache Spark in 2015

February 20 2015 Strata + Hadoop Summit Matei Zaharia

httpwwwslidesharenetdatabricksnew-directions-for-apache-spark-in-2015

3 Vendorsbull ldquoSpark is already an excellent piece of software and is

advancing very quickly No vendor mdash no new project mdashis likely to catch up Chasing Spark would be a wasteof time and would delay availability of real-time analyticand processing services for no good reason rdquoSource MapReduce and Spark December 302013 httpvisionclouderacommapreduce-spark

bull ldquoApache Spark is an open source parallel dataprocessing framework that complements ApacheHadoop to make it easy to develop fast unified Big Dataapplications combining batch streaming and interactiveanalytics on all your datardquohttpwwwclouderacomcontentclouderaenproducts-and-servicescdhsparkhtml

9

3 Vendorsbull ldquoApache Spark is a general-purpose engine for large-

scale data processing Spark supports rapid application

development for big data and allows for code reuse

across batch interactive and streaming applications

Spark also provides advanced execution graphs with in-

memory pipelining to speed up end-to-end application

performancerdquo httpswwwmaprcomproductsapache-spark

bull MapR Adds Complete Apache Spark Stack to its

Distribution for Hadoophttpswwwmaprcomcompanypress-releasesmapr-adds-complete-apache-

spark-stack-its-distribution-hadoop

10

3 Vendorsbull ldquoApache Spark provides an elegant attractive development API and allows data workers to rapidly iterate over data via machine learning and other data science techniques that require fast in-memory data processingrdquo httphortonworkscomhadoopspark

bull Hortonworks A shared vision for Apache Spark on Hadoop October 212014httpsdatabrickscomblog20141031hortonworks-a-shared-vision-for-apache-spark-on-hadoophtml

bull ldquoAt Hortonworks we love Spark and want to help our customers leverage all its benefitsrdquo October 30th 2014httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

11

4 Analystsbull Is Apache Spark replacing Hadoop or complementing

existing Hadoop practice

bull Both are already happening

bull With uncertainty about ldquowhat is Hadooprdquo there is no

reason to think solution stacks built on Spark not

positioned as Hadoop will not continue to proliferate

as the technology matures

bull At the same time Hadoop distributions are all

embracing Spark and including it in their offerings

Source Hadoop Questions from Recent Webinar Span Spectrum

February 25 2015httpblogsgartnercommerv-adrian20150225hadoop-

questions-from-recent-webinar-span-spectrum

12

4 Analysts bull ldquoAfter hearing the confusion between Spark and

Hadoop one too many times I was inspired to write a report The Hadoop Ecosystem Overview Q4 2104

bull For those that have day jobs that donrsquot include constantly tracking Hadoop evolution I dove in and worked with Hadoop vendors and trusted consultants to create a framework

bull We divided the complex Hadoop ecosystem into a core set of tools that all work closely with data stored in Hadoop File System and extended group of components that leverage but do not require itrdquo Source Elephants Pigs Rhinos and Giraphs Oh My ndash Its Time To Get A Handle On Hadoop Posted by Brian Hopkins on November 26 2014

httpblogsforrestercombrian_hopkins14-11-26-elephants_pigs_rhinos_and_giraphs_oh_my_its_time_to_get_a_handle_on_hadoop

13

5 Key Takeaways

1 News Big Data is no longer a Hadoop

monopoly

2 Surveys Listen to what Spark developers are

saying

3 Vendors ltHadoop Vendorgt-tinted goggles

FUD is still being lsquoofferedrsquo by some Hadoop

vendors Claims need to be contextualized

4 Analysts Thorough understanding of the

market dynamics

14

II Big Data Typical Big Data

Stack Hadoop Spark

1 Big Data

2 Typical Big Data Stack

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways

15

1 Big Databull Big Data is still one of the most inflated buzzword of

the last years

bull Big Data is a broad term for data sets so large or

complex that traditional data processing tools are

inadequate httpenwikipediaorgwikiBig_data

bull Hadoop is becoming a traditional tool Above

definition is inadequate

bull ldquoBig Data refers to datasets and flows large enough

that has outpaced our capability to store process

analyze and understandrdquo Amir H Payberah

Swedish Institute of Computer Science (SICS)

16

2 Typical Big Data Stack

17

3 Apache Hadoopbull Apache Hadoop as an example of a Typical Big Data

Stack

bull Hadoop ecosystem = Hadoop Stack + many other tools

(either open source and free or commercial ones)

bull Big Data Ecosystem Dataset httpbigdataandreamostosiname

Incomplete but a useful list of Big Data related projects

packed into a JSON dataset

bull Hadoops Impact on Data Managements Future - Amr

Awadallah (Strata + Hadoop 2015) February 19 2015 Watch

video at 236 on lsquoHadoop Isnrsquot Just Hadoop Anymorersquo for a picture

representing the evolution of Apache Hadoop

httpswwwyoutubecomwatchv=1KvTZZAkHy0

18

4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stack

bull Apache Spark provides you Big Data computing and more

bull BYOS Bring Your Own Storage

bull BYOC Bring Your Own Cluster

bull Spark Core httpsparkbigdatacomcomponenttagstag11-core-spark

bull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-streaming

bull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sql

bull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllib

bull GraphX httpsparkbigdatacomcomponenttagstag6-graphx

bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned

19

5 Key Takeaways

1 Big Data Still one of the most inflated

buzzword

2 Typical Big Data Stack Big Data Stacks look

similar on paper Arenrsquot they

3 Apache Hadoop Hadoop is no longer

lsquosynonymousrsquo of Big Data

4 Apache Spark Emergence of the Apache

Spark ecosystem

20

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

21

1 Evolution of Programming APIs

bull MapReduce in Java is like assembly code of Big

Data httpwikiapacheorghadoopWordCount

bull Pig httppigapacheorg

bull Hive httphiveapacheorg

bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi

bull Cascading httpwwwcascadingorg

bull Scalding A Scala API for Cascading httptwittercomscalding

bull Crunch httpcrunchapacheorg

bull Scrunch httpcrunchapacheorgscrunchhtml

22

1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007

MapReduce v1 was the only choice as a compute model

(Execution Engine) on Hadoop Now we have in addition

to MapReduce v2 Tez Spark and Flink

23

bull Batch bull Batch

bull Interactive

bull Batch

bull Interactive

bull Near-Real

time

bull Batch

bull Interactive

bull Real-Time

bull Iterative

bull 1st

Generation

bull 2nd

Generation

bull 3rd

Generation

bull 4th

Generation

1 Evolution

bull This is how Hadoop MapReduce is branding itself ldquoA YARN-based system for parallel processing of large data sets httphadoopapacheorg

bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip

bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job

bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics

24

1 Evolution

bull Tez Hindi for ldquospeedrdquo

bull This is how Apache Tez is branding itself ldquoTheApache Tez project is aimed at building anapplication framework which allows for a complexdirected-acyclic-graph of tasks for processingdata It is currently built atop YARNrdquo

Source httptezapacheorg

bull Apachetrade Tez is an extensible framework for

building high performance batch and

interactive data processing applicationscoordinated by YARN in Apache Hadoop

25

1 Evolution

bull lsquoSparkrsquo for lightning fast speed

bull This is how Apache Spark is branding itselfldquoApache Sparktrade is a fast and general engine forlarge-scale data processingrdquo httpssparkapacheorg

bull Apache Spark is a general purpose clustercomputing framework its execution modelsupports wide variety of use cases batchinteractive near-real time

bull The rapid in-memory processing of resilientdistributed datasets (RDDs) is the ldquocorecapabilityrdquo of Apache Spark

26

1 Evolution Apache Flink

bull Flink German for ldquonimble swift speedyrdquo

bull This is how Apache Flink is branding itself ldquoFast andreliable large-scale data processing enginerdquo

bull Apache Flink httpflinkapacheorg offers

bull Batch and Streaming in the same system

bull Beyond DAGs (Cyclic operator graphs)

bull Powerful expressive APIs

bull Inside-the-system iterations

bull Full Hadoop compatibility

bull Automatic language independent optimizer

bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink

27

Hadoop MapReduce vs Tez vs Spark

Criteria

License Open Source

Apache 20 version

2x

Open Source

Apache 20

version 0x

Open Source

Apache 20 version

1x

Processing

Model

On-Disk (Disk-

based

parallelization)

Batch

On-Disk Batch

Interactive

In-Memory On-Disk

Batch Interactive

Streaming (Near Real-

Time)

Language written

in

Java Java Scala

API [Java Python

Scala] User-Facing

Java[

ISVEngineTool

builder]

[Scala Java Python]

User-Facing

Libraries None separate tools None [Spark Core Spark

Streaming Spark SQL

MLlib GraphX]

28

Hadoop MapReduce vs Tez vs Spark

Criteria

Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to

Hadoop

Ease of Use Difficult to program

needs abstractions

No Interactive mode

except Hive Pig

Difficult to program

No Interactive

mode except Hive

Pig

Easy to program

no need of

abstractions

Interactive mode

Compatibilit

y

to data types and data

sources is same

to data types and

data sources is

same

to data types and

data sources is

same

YARN

integration

YARN application Ground up YARN

application

Spark is moving

towards YARN

29

Hadoop MapReduce vs Tez vs Spark

Criteria

Deployment YARN YARN [Standalone YARN

SIMR Mesos hellip]

Performance - Good performance

when data fits into

memory

- performance

degradation otherwise

Security More features and

projects

More

features and

projects

Still in its infancy

30

Partial support

IV Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

31

2 Transition

bull Existing Hadoop MapReduce projects can

migrate to Spark and leverage Spark Core as

execution engine

1 You can often reuse your mapper and

reducer functions and just call them in

Spark from Java or Scala

2 You can translate your code from

MapReduce to Apache Spark How-to

Translate from MapReduce to Apache Sparkhttpblogclouderacomblog201409how-to-translate-from-mapreduce-to-

apache-spark

32

2 Transition

3 The following tools originally based on Hadoop

MapReduce are being ported to Apache Spark

bull Pig Hive Sqoop Cascading Crunch Mahout hellip

33

Pig on Spark (Spork)

bull Run Pig with ldquondashx sparkrdquo option for an easy migration

without development effort

bull Speed up your existing pig scripts on Spark ( Query

Logical Plan Physical Pan)

bull Leverage new Spark specific operators in Pig such as

Cache

bull Still leverage many existing Pig UDF libraries

bull Pig on Spark Umbrella Jira (Status Passed end-to-end test

cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059

bull Fix outstanding issues and address additional Spark functionality

through the community

bull lsquoPig on Sparkrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag19

34

Hive on Spark (Currently in Beta

Expected in Hive 110)

bull New alternative to using MapReduce or Tez

hivegt set hiveexecutionengine=spark

bull Help existing Hive applications running on

MapReduce or Tez easily migrate to Spark without

development effort

bull Exposes Spark users to a viable feature-rich de facto

standard SQL tool on Hadoop

bull Performance benefits especially for Hive queries

involving multiple reducer stages

bull Hive on Spark Umbrella Jira (Status Open) Q1 2015httpsissuesapacheorgjirabrowseHIVE-7292

35

Hive on Spark (Currently in Beta

Expected in Hive 110)

bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-

motivations-and-design-principles

bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Start

ed

bull Hive on Spark February 11 2015 Szehon Ho

Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark

bull Hive on spark is blazing fast or is it Carter Shanklin and

Mostapah Mokhtar (Hortonworks) February 20 2015httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final

bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12

36

Sqoop on Spark

(Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially

developed as a tool to transfer data from RDBMS to

Hadoop

bull The next version of Sqoop referred to as Sqoop2

supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under

discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Pro

posal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira

Status Work In Progress) The goal of this ticket is to support a

pluggable way to select the execution engine on which we can run

the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

37

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application

development platform for building data applications on

Hadoop

bull Support for Apache Spark is on the roadmap and will be

available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the

transition from CascadingScalding to Spark a little

easier by adding support for Cascading Taps Scalding

Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark

38

Apache Crunch

bull The Apache Crunch Java library provides a

framework for writing testing and running

MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a

SparkPipeline class making it easy to migrate

data processing applications from MapReduce

to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark

Pipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-

xtopicscdh_ig_running_crunch_with_sparkhtml

39

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce

Apache Mahout the original Machine Learning (ML)

library for Hadoop since 2009 is rejecting new

MapReduce algorithm

implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark

bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this

DSL are automatically optimized and executed in

parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for

Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

40

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov

April 2014

httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with

Mahout Scala and Spark Published on May 30 2014

httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-

with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)-

MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

41

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

42

3 IntegrationService Open Source Tool

StorageServi

ng Layer

Data Formats

Data

Ingestion

Services

Resource

Management

Search

SQL

43

3 Integration

bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851

44

3 Integration

bull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via

newAPIHadoopRDD Example HBaseTestscala from

Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach

esparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available

for reading from and writing to HBase without the need

of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with

Spark Status Still in experimentation and no timetable for

possible support httpblogclouderacomblog201412new-in-cloudera-

labs-sparkonhbase

45

3 Integration

bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark

RDDs to Cassandra tables and execute arbitrary CQL

queries in your Spark applications Supports also

integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration

is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra

46

3 Integration

bull Benchmark of Spark amp Cassandra Integration

using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume

data from Cassandra to spark and store Resilient

Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new

avenues

bull Kindling An Introduction to Spark with Cassandra

(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-

spark-with-cassandra

47

3 Integration

bull MongoDB is not directly served by Spark although

it can be used from Spark via an official Mongo-

Hadoop connector

bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-

insights

bull Spark SQL also provides indirect support via its

support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

48

3 Integration

bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from

Apache Spark (still experimental)

bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-

introduction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-

example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-

example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without

Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

49

3 Integration

bull Neo4j is a highly scalable robust (fully ACID) native graph

database

bull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015

httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015

httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph

Analytics By Kenny Bastani November 3 2014

httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

50

3 Integration YARN

bull YARN Yet Another Resource Negotiator Implicit

reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND

20summary20~20yarn20AND20status203D20OPEN20ORDER20

BY20priority20DESC0A

bull Some issues are critical ones

bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

51

3 Integration

bull Spark SQL provides built in support for Hivetables

bull Import relational data from Hive tables

bull Run SQL queries over imported data

bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120

bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib

52

3 Integration

bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to

address new use cases

bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data

sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query

in-memory data in Spark Embed Drill execution in a

Spark data pipeline

Source Whats Coming in 2015 for

Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

53

3 Integration

bull Apache Kafka is a high throughput distributed

messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka

Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming

Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-

example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka

54

3 Integration

bull Apache Flume is a streaming event data

ingestion system that is designed for Big Data

ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with

Flume There are two approaches to this

bull Approach 1 Flume-style Push-based Approach

bull Approach 2 (Experimental) Pull-based

Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

55

3 Integration

bull Spark SQL provides built in support for JSON that

is vastly simplifying the end-to-end-experience of

working with JSON data

bull Spark SQL can automatically infer the schema

of a JSON dataset and load it as a

SchemaRDD No more DDL Just point Spark

SQL to JSON files and query Starting Spark 13

SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-

support-in-spark-sqlhtml

56

3 Integration

bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows to

bull Import relational data from Parquet files

bull Run SQL queries over imported data

bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

57

3 Integration

bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015

bull Problem

bull Various inbound data sets

bull Data Layout can change without notice

bull New data sets can be added without notice

Result

bull Leverage Spark to dynamically split the data

bull Leverage Avro to store the data in a compact binary format

58

3 Integration Kite SDK

bull The Kite SDK provides high level abstractions to

work with datasets on Hadoop hiding many of

the details of compression codecs file formats

partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016

release so Spark jobs can read and write to Kite

datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

59

3 Integration

bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between

Elasticsearch and Apache Spark in the form of RDD that can

read data from Elasticsearch Also any RDD can be saved to

Elasticsearch as long as its content can be translated into

documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache

Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html

60

3 Integration

bull Apache Solr added a Spark-based indexing tool for

fast and easy indexing ingestion and serving

searchable complex data ldquoCrunchIndexerTool on

Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark

Crunch and Morphlines

bull Migrate ingestion of HDFS data into Solr from

MapReduce to Spark

bull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-

intosolrusingsparktrimmed

61

3 Integration

bull HUE is the open source Apache Hadoop Web UI

that lets users use Hadoop directly from their

browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark

Igniter lets users execute and monitor Spark jobs

directly from their browser and be more

productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-

hadoop-by-enrico-berti-at-big-data-spain-2014

62

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

63

4 Complementarity

Components of Hadoop ecosystem and Spark ecosystem

can work together each for what it is especially good at

rather than choosing one of them

64

Hadoop ecosystem Spark ecosystem

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

65

4 Complementarity +

bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesos

bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41

66

4 Complementarity +

References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN

cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache

Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-

resource-management

bull YARN vs MESOS Canrsquot We All Just Get

Along httpstrataconfcombig-data-conference-ca-

2015publicscheduledetail40620

67

4 Complementarity +

bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization

strategies (building the DAG with knowledge of data

distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the

need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with

YARN (resource chaining in clusters)

bull Tez supports enterprise security

68

4 Complementarity +

bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better

since it is more ldquostream orientedrdquo has more mature

shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory

parsed data it can be much better when we process

data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native

YARN Integration httphortonworkscomblogimproving-spark-data-

pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

69

4 Complementarity

bull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal

compute framework at each step in the big data

analytics process based on the type of platform the

attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on

November 13 2014 with Matt Schumpert Director of Product

Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution

Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-

right-execution-enginehtml

70

4 Complementarity

bull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles

Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-

2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to

Mainstream Apache Hadoop Adoption February 12 2015

httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-

Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms

February 23 2015

httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-

migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015

httpblogsyncsortcom201503framework-future-hadoop

71

5 Key Takeaways1 Evolution of compute models is still ongoing

Watch out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

72

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

73

1 File System

Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml

bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift

74

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 6: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

2 Surveysbull Hadoops historic focus on batch processing of data

was well supported by MapReduce but there is an

appetite for more flexible developer tools to support

the larger market of mid-size datasets and use cases

that call for real-time processingrdquo 2015 Apache Spark

Survey by Typesafe January 27 2015

httpwwwmarketwiredcompress-releasesurvey-indicates-apache-spark-

gaining-developer-adoption-as-big-datas-projects-1986162htm

bull Apache Spark Preparing for the Next Wave of

Reactive Big Data January 27 2015 by Typesafe

httptypesafecomblogapache-spark-preparing-for-the-next-wave-of-reactive-

big-data

6

Apache Spark Survey 2015 by

Typesafe - Quick Snapshot

7

3 Vendors

8

bull Spark and Hadoop Working Together January 21

2014 by Ion Stoica httpsdatabrickscomblog20140121spark-and-

hadoophtml

bull Uniform API for diverse workloads over diverse

storage systems and runtimes

Source Slide 16 of lsquoSparks Role in the Big Data Ecosystem (Spark

Summit 2014) November 2014 Matei

Zahariahttpwwwslidesharenetdatabricksspark-summit2014

bull The goal of Apache Spark is to have one engine for all

data sources workloads and environmentsrdquo

Source Slide 15 of lsquoNew Directions for Apache Spark in 2015

February 20 2015 Strata + Hadoop Summit Matei Zaharia

httpwwwslidesharenetdatabricksnew-directions-for-apache-spark-in-2015

3 Vendorsbull ldquoSpark is already an excellent piece of software and is

advancing very quickly No vendor mdash no new project mdashis likely to catch up Chasing Spark would be a wasteof time and would delay availability of real-time analyticand processing services for no good reason rdquoSource MapReduce and Spark December 302013 httpvisionclouderacommapreduce-spark

bull ldquoApache Spark is an open source parallel dataprocessing framework that complements ApacheHadoop to make it easy to develop fast unified Big Dataapplications combining batch streaming and interactiveanalytics on all your datardquohttpwwwclouderacomcontentclouderaenproducts-and-servicescdhsparkhtml

9

3 Vendorsbull ldquoApache Spark is a general-purpose engine for large-

scale data processing Spark supports rapid application

development for big data and allows for code reuse

across batch interactive and streaming applications

Spark also provides advanced execution graphs with in-

memory pipelining to speed up end-to-end application

performancerdquo httpswwwmaprcomproductsapache-spark

bull MapR Adds Complete Apache Spark Stack to its

Distribution for Hadoophttpswwwmaprcomcompanypress-releasesmapr-adds-complete-apache-

spark-stack-its-distribution-hadoop

10

3 Vendorsbull ldquoApache Spark provides an elegant attractive development API and allows data workers to rapidly iterate over data via machine learning and other data science techniques that require fast in-memory data processingrdquo httphortonworkscomhadoopspark

bull Hortonworks A shared vision for Apache Spark on Hadoop October 212014httpsdatabrickscomblog20141031hortonworks-a-shared-vision-for-apache-spark-on-hadoophtml

bull ldquoAt Hortonworks we love Spark and want to help our customers leverage all its benefitsrdquo October 30th 2014httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

11

4 Analystsbull Is Apache Spark replacing Hadoop or complementing

existing Hadoop practice

bull Both are already happening

bull With uncertainty about ldquowhat is Hadooprdquo there is no

reason to think solution stacks built on Spark not

positioned as Hadoop will not continue to proliferate

as the technology matures

bull At the same time Hadoop distributions are all

embracing Spark and including it in their offerings

Source Hadoop Questions from Recent Webinar Span Spectrum

February 25 2015httpblogsgartnercommerv-adrian20150225hadoop-

questions-from-recent-webinar-span-spectrum

12

4 Analysts bull ldquoAfter hearing the confusion between Spark and

Hadoop one too many times I was inspired to write a report The Hadoop Ecosystem Overview Q4 2104

bull For those that have day jobs that donrsquot include constantly tracking Hadoop evolution I dove in and worked with Hadoop vendors and trusted consultants to create a framework

bull We divided the complex Hadoop ecosystem into a core set of tools that all work closely with data stored in Hadoop File System and extended group of components that leverage but do not require itrdquo Source Elephants Pigs Rhinos and Giraphs Oh My ndash Its Time To Get A Handle On Hadoop Posted by Brian Hopkins on November 26 2014

httpblogsforrestercombrian_hopkins14-11-26-elephants_pigs_rhinos_and_giraphs_oh_my_its_time_to_get_a_handle_on_hadoop

13

5 Key Takeaways

1 News Big Data is no longer a Hadoop

monopoly

2 Surveys Listen to what Spark developers are

saying

3 Vendors ltHadoop Vendorgt-tinted goggles

FUD is still being lsquoofferedrsquo by some Hadoop

vendors Claims need to be contextualized

4 Analysts Thorough understanding of the

market dynamics

14

II Big Data Typical Big Data

Stack Hadoop Spark

1 Big Data

2 Typical Big Data Stack

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways

15

1 Big Databull Big Data is still one of the most inflated buzzword of

the last years

bull Big Data is a broad term for data sets so large or

complex that traditional data processing tools are

inadequate httpenwikipediaorgwikiBig_data

bull Hadoop is becoming a traditional tool Above

definition is inadequate

bull ldquoBig Data refers to datasets and flows large enough

that has outpaced our capability to store process

analyze and understandrdquo Amir H Payberah

Swedish Institute of Computer Science (SICS)

16

2 Typical Big Data Stack

17

3 Apache Hadoopbull Apache Hadoop as an example of a Typical Big Data

Stack

bull Hadoop ecosystem = Hadoop Stack + many other tools

(either open source and free or commercial ones)

bull Big Data Ecosystem Dataset httpbigdataandreamostosiname

Incomplete but a useful list of Big Data related projects

packed into a JSON dataset

bull Hadoops Impact on Data Managements Future - Amr

Awadallah (Strata + Hadoop 2015) February 19 2015 Watch

video at 236 on lsquoHadoop Isnrsquot Just Hadoop Anymorersquo for a picture

representing the evolution of Apache Hadoop

httpswwwyoutubecomwatchv=1KvTZZAkHy0

18

4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stack

bull Apache Spark provides you Big Data computing and more

bull BYOS Bring Your Own Storage

bull BYOC Bring Your Own Cluster

bull Spark Core httpsparkbigdatacomcomponenttagstag11-core-spark

bull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-streaming

bull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sql

bull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllib

bull GraphX httpsparkbigdatacomcomponenttagstag6-graphx

bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned

19

5 Key Takeaways

1 Big Data Still one of the most inflated

buzzword

2 Typical Big Data Stack Big Data Stacks look

similar on paper Arenrsquot they

3 Apache Hadoop Hadoop is no longer

lsquosynonymousrsquo of Big Data

4 Apache Spark Emergence of the Apache

Spark ecosystem

20

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

21

1 Evolution of Programming APIs

bull MapReduce in Java is like assembly code of Big

Data httpwikiapacheorghadoopWordCount

bull Pig httppigapacheorg

bull Hive httphiveapacheorg

bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi

bull Cascading httpwwwcascadingorg

bull Scalding A Scala API for Cascading httptwittercomscalding

bull Crunch httpcrunchapacheorg

bull Scrunch httpcrunchapacheorgscrunchhtml

22

1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007

MapReduce v1 was the only choice as a compute model

(Execution Engine) on Hadoop Now we have in addition

to MapReduce v2 Tez Spark and Flink

23

bull Batch bull Batch

bull Interactive

bull Batch

bull Interactive

bull Near-Real

time

bull Batch

bull Interactive

bull Real-Time

bull Iterative

bull 1st

Generation

bull 2nd

Generation

bull 3rd

Generation

bull 4th

Generation

1 Evolution

bull This is how Hadoop MapReduce is branding itself ldquoA YARN-based system for parallel processing of large data sets httphadoopapacheorg

bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip

bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job

bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics

24

1 Evolution

bull Tez Hindi for ldquospeedrdquo

bull This is how Apache Tez is branding itself ldquoTheApache Tez project is aimed at building anapplication framework which allows for a complexdirected-acyclic-graph of tasks for processingdata It is currently built atop YARNrdquo

Source httptezapacheorg

bull Apachetrade Tez is an extensible framework for

building high performance batch and

interactive data processing applicationscoordinated by YARN in Apache Hadoop

25

1 Evolution

bull lsquoSparkrsquo for lightning fast speed

bull This is how Apache Spark is branding itselfldquoApache Sparktrade is a fast and general engine forlarge-scale data processingrdquo httpssparkapacheorg

bull Apache Spark is a general purpose clustercomputing framework its execution modelsupports wide variety of use cases batchinteractive near-real time

bull The rapid in-memory processing of resilientdistributed datasets (RDDs) is the ldquocorecapabilityrdquo of Apache Spark

26

1 Evolution Apache Flink

bull Flink German for ldquonimble swift speedyrdquo

bull This is how Apache Flink is branding itself ldquoFast andreliable large-scale data processing enginerdquo

bull Apache Flink httpflinkapacheorg offers

bull Batch and Streaming in the same system

bull Beyond DAGs (Cyclic operator graphs)

bull Powerful expressive APIs

bull Inside-the-system iterations

bull Full Hadoop compatibility

bull Automatic language independent optimizer

bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink

27

Hadoop MapReduce vs Tez vs Spark

Criteria

License Open Source

Apache 20 version

2x

Open Source

Apache 20

version 0x

Open Source

Apache 20 version

1x

Processing

Model

On-Disk (Disk-

based

parallelization)

Batch

On-Disk Batch

Interactive

In-Memory On-Disk

Batch Interactive

Streaming (Near Real-

Time)

Language written

in

Java Java Scala

API [Java Python

Scala] User-Facing

Java[

ISVEngineTool

builder]

[Scala Java Python]

User-Facing

Libraries None separate tools None [Spark Core Spark

Streaming Spark SQL

MLlib GraphX]

28

Hadoop MapReduce vs Tez vs Spark

Criteria

Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to

Hadoop

Ease of Use Difficult to program

needs abstractions

No Interactive mode

except Hive Pig

Difficult to program

No Interactive

mode except Hive

Pig

Easy to program

no need of

abstractions

Interactive mode

Compatibilit

y

to data types and data

sources is same

to data types and

data sources is

same

to data types and

data sources is

same

YARN

integration

YARN application Ground up YARN

application

Spark is moving

towards YARN

29

Hadoop MapReduce vs Tez vs Spark

Criteria

Deployment YARN YARN [Standalone YARN

SIMR Mesos hellip]

Performance - Good performance

when data fits into

memory

- performance

degradation otherwise

Security More features and

projects

More

features and

projects

Still in its infancy

30

Partial support

IV Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

31

2 Transition

bull Existing Hadoop MapReduce projects can

migrate to Spark and leverage Spark Core as

execution engine

1 You can often reuse your mapper and

reducer functions and just call them in

Spark from Java or Scala

2 You can translate your code from

MapReduce to Apache Spark How-to

Translate from MapReduce to Apache Sparkhttpblogclouderacomblog201409how-to-translate-from-mapreduce-to-

apache-spark

32

2 Transition

3 The following tools originally based on Hadoop

MapReduce are being ported to Apache Spark

bull Pig Hive Sqoop Cascading Crunch Mahout hellip

33

Pig on Spark (Spork)

bull Run Pig with ldquondashx sparkrdquo option for an easy migration

without development effort

bull Speed up your existing pig scripts on Spark ( Query

Logical Plan Physical Pan)

bull Leverage new Spark specific operators in Pig such as

Cache

bull Still leverage many existing Pig UDF libraries

bull Pig on Spark Umbrella Jira (Status Passed end-to-end test

cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059

bull Fix outstanding issues and address additional Spark functionality

through the community

bull lsquoPig on Sparkrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag19

34

Hive on Spark (Currently in Beta

Expected in Hive 110)

bull New alternative to using MapReduce or Tez

hivegt set hiveexecutionengine=spark

bull Help existing Hive applications running on

MapReduce or Tez easily migrate to Spark without

development effort

bull Exposes Spark users to a viable feature-rich de facto

standard SQL tool on Hadoop

bull Performance benefits especially for Hive queries

involving multiple reducer stages

bull Hive on Spark Umbrella Jira (Status Open) Q1 2015httpsissuesapacheorgjirabrowseHIVE-7292

35

Hive on Spark (Currently in Beta

Expected in Hive 110)

bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-

motivations-and-design-principles

bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Start

ed

bull Hive on Spark February 11 2015 Szehon Ho

Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark

bull Hive on spark is blazing fast or is it Carter Shanklin and

Mostapah Mokhtar (Hortonworks) February 20 2015httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final

bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12

36

Sqoop on Spark

(Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially

developed as a tool to transfer data from RDBMS to

Hadoop

bull The next version of Sqoop referred to as Sqoop2

supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under

discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Pro

posal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira

Status Work In Progress) The goal of this ticket is to support a

pluggable way to select the execution engine on which we can run

the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

37

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application

development platform for building data applications on

Hadoop

bull Support for Apache Spark is on the roadmap and will be

available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the

transition from CascadingScalding to Spark a little

easier by adding support for Cascading Taps Scalding

Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark

38

Apache Crunch

bull The Apache Crunch Java library provides a

framework for writing testing and running

MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a

SparkPipeline class making it easy to migrate

data processing applications from MapReduce

to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark

Pipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-

xtopicscdh_ig_running_crunch_with_sparkhtml

39

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce

Apache Mahout the original Machine Learning (ML)

library for Hadoop since 2009 is rejecting new

MapReduce algorithm

implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark

bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this

DSL are automatically optimized and executed in

parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for

Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

40

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov

April 2014

httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with

Mahout Scala and Spark Published on May 30 2014

httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-

with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)-

MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

41

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

42

3 IntegrationService Open Source Tool

StorageServi

ng Layer

Data Formats

Data

Ingestion

Services

Resource

Management

Search

SQL

43

3 Integration

bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851

44

3 Integration

bull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via

newAPIHadoopRDD Example HBaseTestscala from

Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach

esparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available

for reading from and writing to HBase without the need

of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with

Spark Status Still in experimentation and no timetable for

possible support httpblogclouderacomblog201412new-in-cloudera-

labs-sparkonhbase

45

3 Integration

bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark

RDDs to Cassandra tables and execute arbitrary CQL

queries in your Spark applications Supports also

integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration

is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra

46

3 Integration

bull Benchmark of Spark amp Cassandra Integration

using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume

data from Cassandra to spark and store Resilient

Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new

avenues

bull Kindling An Introduction to Spark with Cassandra

(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-

spark-with-cassandra

47

3 Integration

bull MongoDB is not directly served by Spark although

it can be used from Spark via an official Mongo-

Hadoop connector

bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-

insights

bull Spark SQL also provides indirect support via its

support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

48

3 Integration

bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from

Apache Spark (still experimental)

bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-

introduction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-

example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-

example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without

Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

49

3 Integration

bull Neo4j is a highly scalable robust (fully ACID) native graph

database

bull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015

httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015

httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph

Analytics By Kenny Bastani November 3 2014

httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

50

3 Integration YARN

bull YARN Yet Another Resource Negotiator Implicit

reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND

20summary20~20yarn20AND20status203D20OPEN20ORDER20

BY20priority20DESC0A

bull Some issues are critical ones

bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

51

3 Integration

bull Spark SQL provides built in support for Hivetables

bull Import relational data from Hive tables

bull Run SQL queries over imported data

bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120

bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib

52

3 Integration

bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to

address new use cases

bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data

sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query

in-memory data in Spark Embed Drill execution in a

Spark data pipeline

Source Whats Coming in 2015 for

Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

53

3 Integration

bull Apache Kafka is a high throughput distributed

messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka

Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming

Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-

example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka

54

3 Integration

bull Apache Flume is a streaming event data

ingestion system that is designed for Big Data

ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with

Flume There are two approaches to this

bull Approach 1 Flume-style Push-based Approach

bull Approach 2 (Experimental) Pull-based

Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

55

3 Integration

bull Spark SQL provides built in support for JSON that

is vastly simplifying the end-to-end-experience of

working with JSON data

bull Spark SQL can automatically infer the schema

of a JSON dataset and load it as a

SchemaRDD No more DDL Just point Spark

SQL to JSON files and query Starting Spark 13

SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-

support-in-spark-sqlhtml

56

3 Integration

bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows to

bull Import relational data from Parquet files

bull Run SQL queries over imported data

bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

57

3 Integration

bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015

bull Problem

bull Various inbound data sets

bull Data Layout can change without notice

bull New data sets can be added without notice

Result

bull Leverage Spark to dynamically split the data

bull Leverage Avro to store the data in a compact binary format

58

3 Integration Kite SDK

bull The Kite SDK provides high level abstractions to

work with datasets on Hadoop hiding many of

the details of compression codecs file formats

partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016

release so Spark jobs can read and write to Kite

datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

59

3 Integration

bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between

Elasticsearch and Apache Spark in the form of RDD that can

read data from Elasticsearch Also any RDD can be saved to

Elasticsearch as long as its content can be translated into

documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache

Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html

60

3 Integration

bull Apache Solr added a Spark-based indexing tool for

fast and easy indexing ingestion and serving

searchable complex data ldquoCrunchIndexerTool on

Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark

Crunch and Morphlines

bull Migrate ingestion of HDFS data into Solr from

MapReduce to Spark

bull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-

intosolrusingsparktrimmed

61

3 Integration

bull HUE is the open source Apache Hadoop Web UI

that lets users use Hadoop directly from their

browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark

Igniter lets users execute and monitor Spark jobs

directly from their browser and be more

productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-

hadoop-by-enrico-berti-at-big-data-spain-2014

62

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

63

4 Complementarity

Components of Hadoop ecosystem and Spark ecosystem

can work together each for what it is especially good at

rather than choosing one of them

64

Hadoop ecosystem Spark ecosystem

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

65

4 Complementarity +

bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesos

bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41

66

4 Complementarity +

References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN

cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache

Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-

resource-management

bull YARN vs MESOS Canrsquot We All Just Get

Along httpstrataconfcombig-data-conference-ca-

2015publicscheduledetail40620

67

4 Complementarity +

bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization

strategies (building the DAG with knowledge of data

distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the

need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with

YARN (resource chaining in clusters)

bull Tez supports enterprise security

68

4 Complementarity +

bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better

since it is more ldquostream orientedrdquo has more mature

shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory

parsed data it can be much better when we process

data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native

YARN Integration httphortonworkscomblogimproving-spark-data-

pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

69

4 Complementarity

bull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal

compute framework at each step in the big data

analytics process based on the type of platform the

attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on

November 13 2014 with Matt Schumpert Director of Product

Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution

Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-

right-execution-enginehtml

70

4 Complementarity

bull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles

Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-

2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to

Mainstream Apache Hadoop Adoption February 12 2015

httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-

Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms

February 23 2015

httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-

migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015

httpblogsyncsortcom201503framework-future-hadoop

71

5 Key Takeaways1 Evolution of compute models is still ongoing

Watch out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

72

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

73

1 File System

Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml

bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift

74

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 7: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

Apache Spark Survey 2015 by

Typesafe - Quick Snapshot

7

3 Vendors

8

bull Spark and Hadoop Working Together January 21

2014 by Ion Stoica httpsdatabrickscomblog20140121spark-and-

hadoophtml

bull Uniform API for diverse workloads over diverse

storage systems and runtimes

Source Slide 16 of lsquoSparks Role in the Big Data Ecosystem (Spark

Summit 2014) November 2014 Matei

Zahariahttpwwwslidesharenetdatabricksspark-summit2014

bull The goal of Apache Spark is to have one engine for all

data sources workloads and environmentsrdquo

Source Slide 15 of lsquoNew Directions for Apache Spark in 2015

February 20 2015 Strata + Hadoop Summit Matei Zaharia

httpwwwslidesharenetdatabricksnew-directions-for-apache-spark-in-2015

3 Vendorsbull ldquoSpark is already an excellent piece of software and is

advancing very quickly No vendor mdash no new project mdashis likely to catch up Chasing Spark would be a wasteof time and would delay availability of real-time analyticand processing services for no good reason rdquoSource MapReduce and Spark December 302013 httpvisionclouderacommapreduce-spark

bull ldquoApache Spark is an open source parallel dataprocessing framework that complements ApacheHadoop to make it easy to develop fast unified Big Dataapplications combining batch streaming and interactiveanalytics on all your datardquohttpwwwclouderacomcontentclouderaenproducts-and-servicescdhsparkhtml

9

3 Vendorsbull ldquoApache Spark is a general-purpose engine for large-

scale data processing Spark supports rapid application

development for big data and allows for code reuse

across batch interactive and streaming applications

Spark also provides advanced execution graphs with in-

memory pipelining to speed up end-to-end application

performancerdquo httpswwwmaprcomproductsapache-spark

bull MapR Adds Complete Apache Spark Stack to its

Distribution for Hadoophttpswwwmaprcomcompanypress-releasesmapr-adds-complete-apache-

spark-stack-its-distribution-hadoop

10

3 Vendorsbull ldquoApache Spark provides an elegant attractive development API and allows data workers to rapidly iterate over data via machine learning and other data science techniques that require fast in-memory data processingrdquo httphortonworkscomhadoopspark

bull Hortonworks A shared vision for Apache Spark on Hadoop October 212014httpsdatabrickscomblog20141031hortonworks-a-shared-vision-for-apache-spark-on-hadoophtml

bull ldquoAt Hortonworks we love Spark and want to help our customers leverage all its benefitsrdquo October 30th 2014httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

11

4 Analystsbull Is Apache Spark replacing Hadoop or complementing

existing Hadoop practice

bull Both are already happening

bull With uncertainty about ldquowhat is Hadooprdquo there is no

reason to think solution stacks built on Spark not

positioned as Hadoop will not continue to proliferate

as the technology matures

bull At the same time Hadoop distributions are all

embracing Spark and including it in their offerings

Source Hadoop Questions from Recent Webinar Span Spectrum

February 25 2015httpblogsgartnercommerv-adrian20150225hadoop-

questions-from-recent-webinar-span-spectrum

12

4 Analysts bull ldquoAfter hearing the confusion between Spark and

Hadoop one too many times I was inspired to write a report The Hadoop Ecosystem Overview Q4 2104

bull For those that have day jobs that donrsquot include constantly tracking Hadoop evolution I dove in and worked with Hadoop vendors and trusted consultants to create a framework

bull We divided the complex Hadoop ecosystem into a core set of tools that all work closely with data stored in Hadoop File System and extended group of components that leverage but do not require itrdquo Source Elephants Pigs Rhinos and Giraphs Oh My ndash Its Time To Get A Handle On Hadoop Posted by Brian Hopkins on November 26 2014

httpblogsforrestercombrian_hopkins14-11-26-elephants_pigs_rhinos_and_giraphs_oh_my_its_time_to_get_a_handle_on_hadoop

13

5 Key Takeaways

1 News Big Data is no longer a Hadoop

monopoly

2 Surveys Listen to what Spark developers are

saying

3 Vendors ltHadoop Vendorgt-tinted goggles

FUD is still being lsquoofferedrsquo by some Hadoop

vendors Claims need to be contextualized

4 Analysts Thorough understanding of the

market dynamics

14

II Big Data Typical Big Data

Stack Hadoop Spark

1 Big Data

2 Typical Big Data Stack

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways

15

1 Big Databull Big Data is still one of the most inflated buzzword of

the last years

bull Big Data is a broad term for data sets so large or

complex that traditional data processing tools are

inadequate httpenwikipediaorgwikiBig_data

bull Hadoop is becoming a traditional tool Above

definition is inadequate

bull ldquoBig Data refers to datasets and flows large enough

that has outpaced our capability to store process

analyze and understandrdquo Amir H Payberah

Swedish Institute of Computer Science (SICS)

16

2 Typical Big Data Stack

17

3 Apache Hadoopbull Apache Hadoop as an example of a Typical Big Data

Stack

bull Hadoop ecosystem = Hadoop Stack + many other tools

(either open source and free or commercial ones)

bull Big Data Ecosystem Dataset httpbigdataandreamostosiname

Incomplete but a useful list of Big Data related projects

packed into a JSON dataset

bull Hadoops Impact on Data Managements Future - Amr

Awadallah (Strata + Hadoop 2015) February 19 2015 Watch

video at 236 on lsquoHadoop Isnrsquot Just Hadoop Anymorersquo for a picture

representing the evolution of Apache Hadoop

httpswwwyoutubecomwatchv=1KvTZZAkHy0

18

4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stack

bull Apache Spark provides you Big Data computing and more

bull BYOS Bring Your Own Storage

bull BYOC Bring Your Own Cluster

bull Spark Core httpsparkbigdatacomcomponenttagstag11-core-spark

bull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-streaming

bull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sql

bull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllib

bull GraphX httpsparkbigdatacomcomponenttagstag6-graphx

bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned

19

5 Key Takeaways

1 Big Data Still one of the most inflated

buzzword

2 Typical Big Data Stack Big Data Stacks look

similar on paper Arenrsquot they

3 Apache Hadoop Hadoop is no longer

lsquosynonymousrsquo of Big Data

4 Apache Spark Emergence of the Apache

Spark ecosystem

20

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

21

1 Evolution of Programming APIs

bull MapReduce in Java is like assembly code of Big

Data httpwikiapacheorghadoopWordCount

bull Pig httppigapacheorg

bull Hive httphiveapacheorg

bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi

bull Cascading httpwwwcascadingorg

bull Scalding A Scala API for Cascading httptwittercomscalding

bull Crunch httpcrunchapacheorg

bull Scrunch httpcrunchapacheorgscrunchhtml

22

1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007

MapReduce v1 was the only choice as a compute model

(Execution Engine) on Hadoop Now we have in addition

to MapReduce v2 Tez Spark and Flink

23

bull Batch bull Batch

bull Interactive

bull Batch

bull Interactive

bull Near-Real

time

bull Batch

bull Interactive

bull Real-Time

bull Iterative

bull 1st

Generation

bull 2nd

Generation

bull 3rd

Generation

bull 4th

Generation

1 Evolution

bull This is how Hadoop MapReduce is branding itself ldquoA YARN-based system for parallel processing of large data sets httphadoopapacheorg

bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip

bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job

bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics

24

1 Evolution

bull Tez Hindi for ldquospeedrdquo

bull This is how Apache Tez is branding itself ldquoTheApache Tez project is aimed at building anapplication framework which allows for a complexdirected-acyclic-graph of tasks for processingdata It is currently built atop YARNrdquo

Source httptezapacheorg

bull Apachetrade Tez is an extensible framework for

building high performance batch and

interactive data processing applicationscoordinated by YARN in Apache Hadoop

25

1 Evolution

bull lsquoSparkrsquo for lightning fast speed

bull This is how Apache Spark is branding itselfldquoApache Sparktrade is a fast and general engine forlarge-scale data processingrdquo httpssparkapacheorg

bull Apache Spark is a general purpose clustercomputing framework its execution modelsupports wide variety of use cases batchinteractive near-real time

bull The rapid in-memory processing of resilientdistributed datasets (RDDs) is the ldquocorecapabilityrdquo of Apache Spark

26

1 Evolution Apache Flink

bull Flink German for ldquonimble swift speedyrdquo

bull This is how Apache Flink is branding itself ldquoFast andreliable large-scale data processing enginerdquo

bull Apache Flink httpflinkapacheorg offers

bull Batch and Streaming in the same system

bull Beyond DAGs (Cyclic operator graphs)

bull Powerful expressive APIs

bull Inside-the-system iterations

bull Full Hadoop compatibility

bull Automatic language independent optimizer

bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink

27

Hadoop MapReduce vs Tez vs Spark

Criteria

License Open Source

Apache 20 version

2x

Open Source

Apache 20

version 0x

Open Source

Apache 20 version

1x

Processing

Model

On-Disk (Disk-

based

parallelization)

Batch

On-Disk Batch

Interactive

In-Memory On-Disk

Batch Interactive

Streaming (Near Real-

Time)

Language written

in

Java Java Scala

API [Java Python

Scala] User-Facing

Java[

ISVEngineTool

builder]

[Scala Java Python]

User-Facing

Libraries None separate tools None [Spark Core Spark

Streaming Spark SQL

MLlib GraphX]

28

Hadoop MapReduce vs Tez vs Spark

Criteria

Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to

Hadoop

Ease of Use Difficult to program

needs abstractions

No Interactive mode

except Hive Pig

Difficult to program

No Interactive

mode except Hive

Pig

Easy to program

no need of

abstractions

Interactive mode

Compatibilit

y

to data types and data

sources is same

to data types and

data sources is

same

to data types and

data sources is

same

YARN

integration

YARN application Ground up YARN

application

Spark is moving

towards YARN

29

Hadoop MapReduce vs Tez vs Spark

Criteria

Deployment YARN YARN [Standalone YARN

SIMR Mesos hellip]

Performance - Good performance

when data fits into

memory

- performance

degradation otherwise

Security More features and

projects

More

features and

projects

Still in its infancy

30

Partial support

IV Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

31

2 Transition

bull Existing Hadoop MapReduce projects can

migrate to Spark and leverage Spark Core as

execution engine

1 You can often reuse your mapper and

reducer functions and just call them in

Spark from Java or Scala

2 You can translate your code from

MapReduce to Apache Spark How-to

Translate from MapReduce to Apache Sparkhttpblogclouderacomblog201409how-to-translate-from-mapreduce-to-

apache-spark

32

2 Transition

3 The following tools originally based on Hadoop

MapReduce are being ported to Apache Spark

bull Pig Hive Sqoop Cascading Crunch Mahout hellip

33

Pig on Spark (Spork)

bull Run Pig with ldquondashx sparkrdquo option for an easy migration

without development effort

bull Speed up your existing pig scripts on Spark ( Query

Logical Plan Physical Pan)

bull Leverage new Spark specific operators in Pig such as

Cache

bull Still leverage many existing Pig UDF libraries

bull Pig on Spark Umbrella Jira (Status Passed end-to-end test

cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059

bull Fix outstanding issues and address additional Spark functionality

through the community

bull lsquoPig on Sparkrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag19

34

Hive on Spark (Currently in Beta

Expected in Hive 110)

bull New alternative to using MapReduce or Tez

hivegt set hiveexecutionengine=spark

bull Help existing Hive applications running on

MapReduce or Tez easily migrate to Spark without

development effort

bull Exposes Spark users to a viable feature-rich de facto

standard SQL tool on Hadoop

bull Performance benefits especially for Hive queries

involving multiple reducer stages

bull Hive on Spark Umbrella Jira (Status Open) Q1 2015httpsissuesapacheorgjirabrowseHIVE-7292

35

Hive on Spark (Currently in Beta

Expected in Hive 110)

bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-

motivations-and-design-principles

bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Start

ed

bull Hive on Spark February 11 2015 Szehon Ho

Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark

bull Hive on spark is blazing fast or is it Carter Shanklin and

Mostapah Mokhtar (Hortonworks) February 20 2015httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final

bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12

36

Sqoop on Spark

(Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially

developed as a tool to transfer data from RDBMS to

Hadoop

bull The next version of Sqoop referred to as Sqoop2

supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under

discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Pro

posal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira

Status Work In Progress) The goal of this ticket is to support a

pluggable way to select the execution engine on which we can run

the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

37

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application

development platform for building data applications on

Hadoop

bull Support for Apache Spark is on the roadmap and will be

available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the

transition from CascadingScalding to Spark a little

easier by adding support for Cascading Taps Scalding

Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark

38

Apache Crunch

bull The Apache Crunch Java library provides a

framework for writing testing and running

MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a

SparkPipeline class making it easy to migrate

data processing applications from MapReduce

to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark

Pipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-

xtopicscdh_ig_running_crunch_with_sparkhtml

39

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce

Apache Mahout the original Machine Learning (ML)

library for Hadoop since 2009 is rejecting new

MapReduce algorithm

implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark

bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this

DSL are automatically optimized and executed in

parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for

Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

40

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov

April 2014

httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with

Mahout Scala and Spark Published on May 30 2014

httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-

with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)-

MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

41

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

42

3 IntegrationService Open Source Tool

StorageServi

ng Layer

Data Formats

Data

Ingestion

Services

Resource

Management

Search

SQL

43

3 Integration

bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851

44

3 Integration

bull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via

newAPIHadoopRDD Example HBaseTestscala from

Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach

esparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available

for reading from and writing to HBase without the need

of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with

Spark Status Still in experimentation and no timetable for

possible support httpblogclouderacomblog201412new-in-cloudera-

labs-sparkonhbase

45

3 Integration

bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark

RDDs to Cassandra tables and execute arbitrary CQL

queries in your Spark applications Supports also

integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration

is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra

46

3 Integration

bull Benchmark of Spark amp Cassandra Integration

using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume

data from Cassandra to spark and store Resilient

Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new

avenues

bull Kindling An Introduction to Spark with Cassandra

(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-

spark-with-cassandra

47

3 Integration

bull MongoDB is not directly served by Spark although

it can be used from Spark via an official Mongo-

Hadoop connector

bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-

insights

bull Spark SQL also provides indirect support via its

support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

48

3 Integration

bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from

Apache Spark (still experimental)

bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-

introduction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-

example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-

example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without

Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

49

3 Integration

bull Neo4j is a highly scalable robust (fully ACID) native graph

database

bull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015

httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015

httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph

Analytics By Kenny Bastani November 3 2014

httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

50

3 Integration YARN

bull YARN Yet Another Resource Negotiator Implicit

reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND

20summary20~20yarn20AND20status203D20OPEN20ORDER20

BY20priority20DESC0A

bull Some issues are critical ones

bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

51

3 Integration

bull Spark SQL provides built in support for Hivetables

bull Import relational data from Hive tables

bull Run SQL queries over imported data

bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120

bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib

52

3 Integration

bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to

address new use cases

bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data

sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query

in-memory data in Spark Embed Drill execution in a

Spark data pipeline

Source Whats Coming in 2015 for

Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

53

3 Integration

bull Apache Kafka is a high throughput distributed

messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka

Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming

Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-

example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka

54

3 Integration

bull Apache Flume is a streaming event data

ingestion system that is designed for Big Data

ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with

Flume There are two approaches to this

bull Approach 1 Flume-style Push-based Approach

bull Approach 2 (Experimental) Pull-based

Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

55

3 Integration

bull Spark SQL provides built in support for JSON that

is vastly simplifying the end-to-end-experience of

working with JSON data

bull Spark SQL can automatically infer the schema

of a JSON dataset and load it as a

SchemaRDD No more DDL Just point Spark

SQL to JSON files and query Starting Spark 13

SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-

support-in-spark-sqlhtml

56

3 Integration

bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows to

bull Import relational data from Parquet files

bull Run SQL queries over imported data

bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

57

3 Integration

bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015

bull Problem

bull Various inbound data sets

bull Data Layout can change without notice

bull New data sets can be added without notice

Result

bull Leverage Spark to dynamically split the data

bull Leverage Avro to store the data in a compact binary format

58

3 Integration Kite SDK

bull The Kite SDK provides high level abstractions to

work with datasets on Hadoop hiding many of

the details of compression codecs file formats

partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016

release so Spark jobs can read and write to Kite

datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

59

3 Integration

bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between

Elasticsearch and Apache Spark in the form of RDD that can

read data from Elasticsearch Also any RDD can be saved to

Elasticsearch as long as its content can be translated into

documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache

Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html

60

3 Integration

bull Apache Solr added a Spark-based indexing tool for

fast and easy indexing ingestion and serving

searchable complex data ldquoCrunchIndexerTool on

Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark

Crunch and Morphlines

bull Migrate ingestion of HDFS data into Solr from

MapReduce to Spark

bull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-

intosolrusingsparktrimmed

61

3 Integration

bull HUE is the open source Apache Hadoop Web UI

that lets users use Hadoop directly from their

browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark

Igniter lets users execute and monitor Spark jobs

directly from their browser and be more

productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-

hadoop-by-enrico-berti-at-big-data-spain-2014

62

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

63

4 Complementarity

Components of Hadoop ecosystem and Spark ecosystem

can work together each for what it is especially good at

rather than choosing one of them

64

Hadoop ecosystem Spark ecosystem

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

65

4 Complementarity +

bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesos

bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41

66

4 Complementarity +

References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN

cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache

Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-

resource-management

bull YARN vs MESOS Canrsquot We All Just Get

Along httpstrataconfcombig-data-conference-ca-

2015publicscheduledetail40620

67

4 Complementarity +

bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization

strategies (building the DAG with knowledge of data

distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the

need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with

YARN (resource chaining in clusters)

bull Tez supports enterprise security

68

4 Complementarity +

bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better

since it is more ldquostream orientedrdquo has more mature

shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory

parsed data it can be much better when we process

data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native

YARN Integration httphortonworkscomblogimproving-spark-data-

pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

69

4 Complementarity

bull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal

compute framework at each step in the big data

analytics process based on the type of platform the

attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on

November 13 2014 with Matt Schumpert Director of Product

Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution

Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-

right-execution-enginehtml

70

4 Complementarity

bull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles

Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-

2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to

Mainstream Apache Hadoop Adoption February 12 2015

httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-

Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms

February 23 2015

httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-

migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015

httpblogsyncsortcom201503framework-future-hadoop

71

5 Key Takeaways1 Evolution of compute models is still ongoing

Watch out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

72

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

73

1 File System

Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml

bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift

74

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 8: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

3 Vendors

8

bull Spark and Hadoop Working Together January 21

2014 by Ion Stoica httpsdatabrickscomblog20140121spark-and-

hadoophtml

bull Uniform API for diverse workloads over diverse

storage systems and runtimes

Source Slide 16 of lsquoSparks Role in the Big Data Ecosystem (Spark

Summit 2014) November 2014 Matei

Zahariahttpwwwslidesharenetdatabricksspark-summit2014

bull The goal of Apache Spark is to have one engine for all

data sources workloads and environmentsrdquo

Source Slide 15 of lsquoNew Directions for Apache Spark in 2015

February 20 2015 Strata + Hadoop Summit Matei Zaharia

httpwwwslidesharenetdatabricksnew-directions-for-apache-spark-in-2015

3 Vendorsbull ldquoSpark is already an excellent piece of software and is

advancing very quickly No vendor mdash no new project mdashis likely to catch up Chasing Spark would be a wasteof time and would delay availability of real-time analyticand processing services for no good reason rdquoSource MapReduce and Spark December 302013 httpvisionclouderacommapreduce-spark

bull ldquoApache Spark is an open source parallel dataprocessing framework that complements ApacheHadoop to make it easy to develop fast unified Big Dataapplications combining batch streaming and interactiveanalytics on all your datardquohttpwwwclouderacomcontentclouderaenproducts-and-servicescdhsparkhtml

9

3 Vendorsbull ldquoApache Spark is a general-purpose engine for large-

scale data processing Spark supports rapid application

development for big data and allows for code reuse

across batch interactive and streaming applications

Spark also provides advanced execution graphs with in-

memory pipelining to speed up end-to-end application

performancerdquo httpswwwmaprcomproductsapache-spark

bull MapR Adds Complete Apache Spark Stack to its

Distribution for Hadoophttpswwwmaprcomcompanypress-releasesmapr-adds-complete-apache-

spark-stack-its-distribution-hadoop

10

3 Vendorsbull ldquoApache Spark provides an elegant attractive development API and allows data workers to rapidly iterate over data via machine learning and other data science techniques that require fast in-memory data processingrdquo httphortonworkscomhadoopspark

bull Hortonworks A shared vision for Apache Spark on Hadoop October 212014httpsdatabrickscomblog20141031hortonworks-a-shared-vision-for-apache-spark-on-hadoophtml

bull ldquoAt Hortonworks we love Spark and want to help our customers leverage all its benefitsrdquo October 30th 2014httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

11

4 Analystsbull Is Apache Spark replacing Hadoop or complementing

existing Hadoop practice

bull Both are already happening

bull With uncertainty about ldquowhat is Hadooprdquo there is no

reason to think solution stacks built on Spark not

positioned as Hadoop will not continue to proliferate

as the technology matures

bull At the same time Hadoop distributions are all

embracing Spark and including it in their offerings

Source Hadoop Questions from Recent Webinar Span Spectrum

February 25 2015httpblogsgartnercommerv-adrian20150225hadoop-

questions-from-recent-webinar-span-spectrum

12

4 Analysts bull ldquoAfter hearing the confusion between Spark and

Hadoop one too many times I was inspired to write a report The Hadoop Ecosystem Overview Q4 2104

bull For those that have day jobs that donrsquot include constantly tracking Hadoop evolution I dove in and worked with Hadoop vendors and trusted consultants to create a framework

bull We divided the complex Hadoop ecosystem into a core set of tools that all work closely with data stored in Hadoop File System and extended group of components that leverage but do not require itrdquo Source Elephants Pigs Rhinos and Giraphs Oh My ndash Its Time To Get A Handle On Hadoop Posted by Brian Hopkins on November 26 2014

httpblogsforrestercombrian_hopkins14-11-26-elephants_pigs_rhinos_and_giraphs_oh_my_its_time_to_get_a_handle_on_hadoop

13

5 Key Takeaways

1 News Big Data is no longer a Hadoop

monopoly

2 Surveys Listen to what Spark developers are

saying

3 Vendors ltHadoop Vendorgt-tinted goggles

FUD is still being lsquoofferedrsquo by some Hadoop

vendors Claims need to be contextualized

4 Analysts Thorough understanding of the

market dynamics

14

II Big Data Typical Big Data

Stack Hadoop Spark

1 Big Data

2 Typical Big Data Stack

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways

15

1 Big Databull Big Data is still one of the most inflated buzzword of

the last years

bull Big Data is a broad term for data sets so large or

complex that traditional data processing tools are

inadequate httpenwikipediaorgwikiBig_data

bull Hadoop is becoming a traditional tool Above

definition is inadequate

bull ldquoBig Data refers to datasets and flows large enough

that has outpaced our capability to store process

analyze and understandrdquo Amir H Payberah

Swedish Institute of Computer Science (SICS)

16

2 Typical Big Data Stack

17

3 Apache Hadoopbull Apache Hadoop as an example of a Typical Big Data

Stack

bull Hadoop ecosystem = Hadoop Stack + many other tools

(either open source and free or commercial ones)

bull Big Data Ecosystem Dataset httpbigdataandreamostosiname

Incomplete but a useful list of Big Data related projects

packed into a JSON dataset

bull Hadoops Impact on Data Managements Future - Amr

Awadallah (Strata + Hadoop 2015) February 19 2015 Watch

video at 236 on lsquoHadoop Isnrsquot Just Hadoop Anymorersquo for a picture

representing the evolution of Apache Hadoop

httpswwwyoutubecomwatchv=1KvTZZAkHy0

18

4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stack

bull Apache Spark provides you Big Data computing and more

bull BYOS Bring Your Own Storage

bull BYOC Bring Your Own Cluster

bull Spark Core httpsparkbigdatacomcomponenttagstag11-core-spark

bull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-streaming

bull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sql

bull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllib

bull GraphX httpsparkbigdatacomcomponenttagstag6-graphx

bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned

19

5 Key Takeaways

1 Big Data Still one of the most inflated

buzzword

2 Typical Big Data Stack Big Data Stacks look

similar on paper Arenrsquot they

3 Apache Hadoop Hadoop is no longer

lsquosynonymousrsquo of Big Data

4 Apache Spark Emergence of the Apache

Spark ecosystem

20

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

21

1 Evolution of Programming APIs

bull MapReduce in Java is like assembly code of Big

Data httpwikiapacheorghadoopWordCount

bull Pig httppigapacheorg

bull Hive httphiveapacheorg

bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi

bull Cascading httpwwwcascadingorg

bull Scalding A Scala API for Cascading httptwittercomscalding

bull Crunch httpcrunchapacheorg

bull Scrunch httpcrunchapacheorgscrunchhtml

22

1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007

MapReduce v1 was the only choice as a compute model

(Execution Engine) on Hadoop Now we have in addition

to MapReduce v2 Tez Spark and Flink

23

bull Batch bull Batch

bull Interactive

bull Batch

bull Interactive

bull Near-Real

time

bull Batch

bull Interactive

bull Real-Time

bull Iterative

bull 1st

Generation

bull 2nd

Generation

bull 3rd

Generation

bull 4th

Generation

1 Evolution

bull This is how Hadoop MapReduce is branding itself ldquoA YARN-based system for parallel processing of large data sets httphadoopapacheorg

bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip

bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job

bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics

24

1 Evolution

bull Tez Hindi for ldquospeedrdquo

bull This is how Apache Tez is branding itself ldquoTheApache Tez project is aimed at building anapplication framework which allows for a complexdirected-acyclic-graph of tasks for processingdata It is currently built atop YARNrdquo

Source httptezapacheorg

bull Apachetrade Tez is an extensible framework for

building high performance batch and

interactive data processing applicationscoordinated by YARN in Apache Hadoop

25

1 Evolution

bull lsquoSparkrsquo for lightning fast speed

bull This is how Apache Spark is branding itselfldquoApache Sparktrade is a fast and general engine forlarge-scale data processingrdquo httpssparkapacheorg

bull Apache Spark is a general purpose clustercomputing framework its execution modelsupports wide variety of use cases batchinteractive near-real time

bull The rapid in-memory processing of resilientdistributed datasets (RDDs) is the ldquocorecapabilityrdquo of Apache Spark

26

1 Evolution Apache Flink

bull Flink German for ldquonimble swift speedyrdquo

bull This is how Apache Flink is branding itself ldquoFast andreliable large-scale data processing enginerdquo

bull Apache Flink httpflinkapacheorg offers

bull Batch and Streaming in the same system

bull Beyond DAGs (Cyclic operator graphs)

bull Powerful expressive APIs

bull Inside-the-system iterations

bull Full Hadoop compatibility

bull Automatic language independent optimizer

bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink

27

Hadoop MapReduce vs Tez vs Spark

Criteria

License Open Source

Apache 20 version

2x

Open Source

Apache 20

version 0x

Open Source

Apache 20 version

1x

Processing

Model

On-Disk (Disk-

based

parallelization)

Batch

On-Disk Batch

Interactive

In-Memory On-Disk

Batch Interactive

Streaming (Near Real-

Time)

Language written

in

Java Java Scala

API [Java Python

Scala] User-Facing

Java[

ISVEngineTool

builder]

[Scala Java Python]

User-Facing

Libraries None separate tools None [Spark Core Spark

Streaming Spark SQL

MLlib GraphX]

28

Hadoop MapReduce vs Tez vs Spark

Criteria

Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to

Hadoop

Ease of Use Difficult to program

needs abstractions

No Interactive mode

except Hive Pig

Difficult to program

No Interactive

mode except Hive

Pig

Easy to program

no need of

abstractions

Interactive mode

Compatibilit

y

to data types and data

sources is same

to data types and

data sources is

same

to data types and

data sources is

same

YARN

integration

YARN application Ground up YARN

application

Spark is moving

towards YARN

29

Hadoop MapReduce vs Tez vs Spark

Criteria

Deployment YARN YARN [Standalone YARN

SIMR Mesos hellip]

Performance - Good performance

when data fits into

memory

- performance

degradation otherwise

Security More features and

projects

More

features and

projects

Still in its infancy

30

Partial support

IV Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

31

2 Transition

bull Existing Hadoop MapReduce projects can

migrate to Spark and leverage Spark Core as

execution engine

1 You can often reuse your mapper and

reducer functions and just call them in

Spark from Java or Scala

2 You can translate your code from

MapReduce to Apache Spark How-to

Translate from MapReduce to Apache Sparkhttpblogclouderacomblog201409how-to-translate-from-mapreduce-to-

apache-spark

32

2 Transition

3 The following tools originally based on Hadoop

MapReduce are being ported to Apache Spark

bull Pig Hive Sqoop Cascading Crunch Mahout hellip

33

Pig on Spark (Spork)

bull Run Pig with ldquondashx sparkrdquo option for an easy migration

without development effort

bull Speed up your existing pig scripts on Spark ( Query

Logical Plan Physical Pan)

bull Leverage new Spark specific operators in Pig such as

Cache

bull Still leverage many existing Pig UDF libraries

bull Pig on Spark Umbrella Jira (Status Passed end-to-end test

cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059

bull Fix outstanding issues and address additional Spark functionality

through the community

bull lsquoPig on Sparkrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag19

34

Hive on Spark (Currently in Beta

Expected in Hive 110)

bull New alternative to using MapReduce or Tez

hivegt set hiveexecutionengine=spark

bull Help existing Hive applications running on

MapReduce or Tez easily migrate to Spark without

development effort

bull Exposes Spark users to a viable feature-rich de facto

standard SQL tool on Hadoop

bull Performance benefits especially for Hive queries

involving multiple reducer stages

bull Hive on Spark Umbrella Jira (Status Open) Q1 2015httpsissuesapacheorgjirabrowseHIVE-7292

35

Hive on Spark (Currently in Beta

Expected in Hive 110)

bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-

motivations-and-design-principles

bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Start

ed

bull Hive on Spark February 11 2015 Szehon Ho

Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark

bull Hive on spark is blazing fast or is it Carter Shanklin and

Mostapah Mokhtar (Hortonworks) February 20 2015httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final

bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12

36

Sqoop on Spark

(Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially

developed as a tool to transfer data from RDBMS to

Hadoop

bull The next version of Sqoop referred to as Sqoop2

supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under

discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Pro

posal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira

Status Work In Progress) The goal of this ticket is to support a

pluggable way to select the execution engine on which we can run

the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

37

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application

development platform for building data applications on

Hadoop

bull Support for Apache Spark is on the roadmap and will be

available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the

transition from CascadingScalding to Spark a little

easier by adding support for Cascading Taps Scalding

Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark

38

Apache Crunch

bull The Apache Crunch Java library provides a

framework for writing testing and running

MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a

SparkPipeline class making it easy to migrate

data processing applications from MapReduce

to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark

Pipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-

xtopicscdh_ig_running_crunch_with_sparkhtml

39

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce

Apache Mahout the original Machine Learning (ML)

library for Hadoop since 2009 is rejecting new

MapReduce algorithm

implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark

bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this

DSL are automatically optimized and executed in

parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for

Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

40

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov

April 2014

httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with

Mahout Scala and Spark Published on May 30 2014

httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-

with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)-

MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

41

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

42

3 IntegrationService Open Source Tool

StorageServi

ng Layer

Data Formats

Data

Ingestion

Services

Resource

Management

Search

SQL

43

3 Integration

bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851

44

3 Integration

bull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via

newAPIHadoopRDD Example HBaseTestscala from

Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach

esparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available

for reading from and writing to HBase without the need

of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with

Spark Status Still in experimentation and no timetable for

possible support httpblogclouderacomblog201412new-in-cloudera-

labs-sparkonhbase

45

3 Integration

bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark

RDDs to Cassandra tables and execute arbitrary CQL

queries in your Spark applications Supports also

integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration

is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra

46

3 Integration

bull Benchmark of Spark amp Cassandra Integration

using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume

data from Cassandra to spark and store Resilient

Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new

avenues

bull Kindling An Introduction to Spark with Cassandra

(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-

spark-with-cassandra

47

3 Integration

bull MongoDB is not directly served by Spark although

it can be used from Spark via an official Mongo-

Hadoop connector

bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-

insights

bull Spark SQL also provides indirect support via its

support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

48

3 Integration

bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from

Apache Spark (still experimental)

bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-

introduction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-

example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-

example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without

Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

49

3 Integration

bull Neo4j is a highly scalable robust (fully ACID) native graph

database

bull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015

httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015

httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph

Analytics By Kenny Bastani November 3 2014

httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

50

3 Integration YARN

bull YARN Yet Another Resource Negotiator Implicit

reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND

20summary20~20yarn20AND20status203D20OPEN20ORDER20

BY20priority20DESC0A

bull Some issues are critical ones

bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

51

3 Integration

bull Spark SQL provides built in support for Hivetables

bull Import relational data from Hive tables

bull Run SQL queries over imported data

bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120

bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib

52

3 Integration

bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to

address new use cases

bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data

sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query

in-memory data in Spark Embed Drill execution in a

Spark data pipeline

Source Whats Coming in 2015 for

Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

53

3 Integration

bull Apache Kafka is a high throughput distributed

messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka

Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming

Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-

example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka

54

3 Integration

bull Apache Flume is a streaming event data

ingestion system that is designed for Big Data

ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with

Flume There are two approaches to this

bull Approach 1 Flume-style Push-based Approach

bull Approach 2 (Experimental) Pull-based

Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

55

3 Integration

bull Spark SQL provides built in support for JSON that

is vastly simplifying the end-to-end-experience of

working with JSON data

bull Spark SQL can automatically infer the schema

of a JSON dataset and load it as a

SchemaRDD No more DDL Just point Spark

SQL to JSON files and query Starting Spark 13

SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-

support-in-spark-sqlhtml

56

3 Integration

bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows to

bull Import relational data from Parquet files

bull Run SQL queries over imported data

bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

57

3 Integration

bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015

bull Problem

bull Various inbound data sets

bull Data Layout can change without notice

bull New data sets can be added without notice

Result

bull Leverage Spark to dynamically split the data

bull Leverage Avro to store the data in a compact binary format

58

3 Integration Kite SDK

bull The Kite SDK provides high level abstractions to

work with datasets on Hadoop hiding many of

the details of compression codecs file formats

partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016

release so Spark jobs can read and write to Kite

datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

59

3 Integration

bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between

Elasticsearch and Apache Spark in the form of RDD that can

read data from Elasticsearch Also any RDD can be saved to

Elasticsearch as long as its content can be translated into

documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache

Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html

60

3 Integration

bull Apache Solr added a Spark-based indexing tool for

fast and easy indexing ingestion and serving

searchable complex data ldquoCrunchIndexerTool on

Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark

Crunch and Morphlines

bull Migrate ingestion of HDFS data into Solr from

MapReduce to Spark

bull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-

intosolrusingsparktrimmed

61

3 Integration

bull HUE is the open source Apache Hadoop Web UI

that lets users use Hadoop directly from their

browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark

Igniter lets users execute and monitor Spark jobs

directly from their browser and be more

productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-

hadoop-by-enrico-berti-at-big-data-spain-2014

62

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

63

4 Complementarity

Components of Hadoop ecosystem and Spark ecosystem

can work together each for what it is especially good at

rather than choosing one of them

64

Hadoop ecosystem Spark ecosystem

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

65

4 Complementarity +

bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesos

bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41

66

4 Complementarity +

References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN

cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache

Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-

resource-management

bull YARN vs MESOS Canrsquot We All Just Get

Along httpstrataconfcombig-data-conference-ca-

2015publicscheduledetail40620

67

4 Complementarity +

bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization

strategies (building the DAG with knowledge of data

distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the

need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with

YARN (resource chaining in clusters)

bull Tez supports enterprise security

68

4 Complementarity +

bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better

since it is more ldquostream orientedrdquo has more mature

shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory

parsed data it can be much better when we process

data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native

YARN Integration httphortonworkscomblogimproving-spark-data-

pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

69

4 Complementarity

bull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal

compute framework at each step in the big data

analytics process based on the type of platform the

attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on

November 13 2014 with Matt Schumpert Director of Product

Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution

Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-

right-execution-enginehtml

70

4 Complementarity

bull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles

Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-

2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to

Mainstream Apache Hadoop Adoption February 12 2015

httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-

Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms

February 23 2015

httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-

migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015

httpblogsyncsortcom201503framework-future-hadoop

71

5 Key Takeaways1 Evolution of compute models is still ongoing

Watch out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

72

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

73

1 File System

Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml

bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift

74

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 9: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

3 Vendorsbull ldquoSpark is already an excellent piece of software and is

advancing very quickly No vendor mdash no new project mdashis likely to catch up Chasing Spark would be a wasteof time and would delay availability of real-time analyticand processing services for no good reason rdquoSource MapReduce and Spark December 302013 httpvisionclouderacommapreduce-spark

bull ldquoApache Spark is an open source parallel dataprocessing framework that complements ApacheHadoop to make it easy to develop fast unified Big Dataapplications combining batch streaming and interactiveanalytics on all your datardquohttpwwwclouderacomcontentclouderaenproducts-and-servicescdhsparkhtml

9

3 Vendorsbull ldquoApache Spark is a general-purpose engine for large-

scale data processing Spark supports rapid application

development for big data and allows for code reuse

across batch interactive and streaming applications

Spark also provides advanced execution graphs with in-

memory pipelining to speed up end-to-end application

performancerdquo httpswwwmaprcomproductsapache-spark

bull MapR Adds Complete Apache Spark Stack to its

Distribution for Hadoophttpswwwmaprcomcompanypress-releasesmapr-adds-complete-apache-

spark-stack-its-distribution-hadoop

10

3 Vendorsbull ldquoApache Spark provides an elegant attractive development API and allows data workers to rapidly iterate over data via machine learning and other data science techniques that require fast in-memory data processingrdquo httphortonworkscomhadoopspark

bull Hortonworks A shared vision for Apache Spark on Hadoop October 212014httpsdatabrickscomblog20141031hortonworks-a-shared-vision-for-apache-spark-on-hadoophtml

bull ldquoAt Hortonworks we love Spark and want to help our customers leverage all its benefitsrdquo October 30th 2014httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

11

4 Analystsbull Is Apache Spark replacing Hadoop or complementing

existing Hadoop practice

bull Both are already happening

bull With uncertainty about ldquowhat is Hadooprdquo there is no

reason to think solution stacks built on Spark not

positioned as Hadoop will not continue to proliferate

as the technology matures

bull At the same time Hadoop distributions are all

embracing Spark and including it in their offerings

Source Hadoop Questions from Recent Webinar Span Spectrum

February 25 2015httpblogsgartnercommerv-adrian20150225hadoop-

questions-from-recent-webinar-span-spectrum

12

4 Analysts bull ldquoAfter hearing the confusion between Spark and

Hadoop one too many times I was inspired to write a report The Hadoop Ecosystem Overview Q4 2104

bull For those that have day jobs that donrsquot include constantly tracking Hadoop evolution I dove in and worked with Hadoop vendors and trusted consultants to create a framework

bull We divided the complex Hadoop ecosystem into a core set of tools that all work closely with data stored in Hadoop File System and extended group of components that leverage but do not require itrdquo Source Elephants Pigs Rhinos and Giraphs Oh My ndash Its Time To Get A Handle On Hadoop Posted by Brian Hopkins on November 26 2014

httpblogsforrestercombrian_hopkins14-11-26-elephants_pigs_rhinos_and_giraphs_oh_my_its_time_to_get_a_handle_on_hadoop

13

5 Key Takeaways

1 News Big Data is no longer a Hadoop

monopoly

2 Surveys Listen to what Spark developers are

saying

3 Vendors ltHadoop Vendorgt-tinted goggles

FUD is still being lsquoofferedrsquo by some Hadoop

vendors Claims need to be contextualized

4 Analysts Thorough understanding of the

market dynamics

14

II Big Data Typical Big Data

Stack Hadoop Spark

1 Big Data

2 Typical Big Data Stack

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways

15

1 Big Databull Big Data is still one of the most inflated buzzword of

the last years

bull Big Data is a broad term for data sets so large or

complex that traditional data processing tools are

inadequate httpenwikipediaorgwikiBig_data

bull Hadoop is becoming a traditional tool Above

definition is inadequate

bull ldquoBig Data refers to datasets and flows large enough

that has outpaced our capability to store process

analyze and understandrdquo Amir H Payberah

Swedish Institute of Computer Science (SICS)

16

2 Typical Big Data Stack

17

3 Apache Hadoopbull Apache Hadoop as an example of a Typical Big Data

Stack

bull Hadoop ecosystem = Hadoop Stack + many other tools

(either open source and free or commercial ones)

bull Big Data Ecosystem Dataset httpbigdataandreamostosiname

Incomplete but a useful list of Big Data related projects

packed into a JSON dataset

bull Hadoops Impact on Data Managements Future - Amr

Awadallah (Strata + Hadoop 2015) February 19 2015 Watch

video at 236 on lsquoHadoop Isnrsquot Just Hadoop Anymorersquo for a picture

representing the evolution of Apache Hadoop

httpswwwyoutubecomwatchv=1KvTZZAkHy0

18

4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stack

bull Apache Spark provides you Big Data computing and more

bull BYOS Bring Your Own Storage

bull BYOC Bring Your Own Cluster

bull Spark Core httpsparkbigdatacomcomponenttagstag11-core-spark

bull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-streaming

bull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sql

bull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllib

bull GraphX httpsparkbigdatacomcomponenttagstag6-graphx

bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned

19

5 Key Takeaways

1 Big Data Still one of the most inflated

buzzword

2 Typical Big Data Stack Big Data Stacks look

similar on paper Arenrsquot they

3 Apache Hadoop Hadoop is no longer

lsquosynonymousrsquo of Big Data

4 Apache Spark Emergence of the Apache

Spark ecosystem

20

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

21

1 Evolution of Programming APIs

bull MapReduce in Java is like assembly code of Big

Data httpwikiapacheorghadoopWordCount

bull Pig httppigapacheorg

bull Hive httphiveapacheorg

bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi

bull Cascading httpwwwcascadingorg

bull Scalding A Scala API for Cascading httptwittercomscalding

bull Crunch httpcrunchapacheorg

bull Scrunch httpcrunchapacheorgscrunchhtml

22

1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007

MapReduce v1 was the only choice as a compute model

(Execution Engine) on Hadoop Now we have in addition

to MapReduce v2 Tez Spark and Flink

23

bull Batch bull Batch

bull Interactive

bull Batch

bull Interactive

bull Near-Real

time

bull Batch

bull Interactive

bull Real-Time

bull Iterative

bull 1st

Generation

bull 2nd

Generation

bull 3rd

Generation

bull 4th

Generation

1 Evolution

bull This is how Hadoop MapReduce is branding itself ldquoA YARN-based system for parallel processing of large data sets httphadoopapacheorg

bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip

bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job

bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics

24

1 Evolution

bull Tez Hindi for ldquospeedrdquo

bull This is how Apache Tez is branding itself ldquoTheApache Tez project is aimed at building anapplication framework which allows for a complexdirected-acyclic-graph of tasks for processingdata It is currently built atop YARNrdquo

Source httptezapacheorg

bull Apachetrade Tez is an extensible framework for

building high performance batch and

interactive data processing applicationscoordinated by YARN in Apache Hadoop

25

1 Evolution

bull lsquoSparkrsquo for lightning fast speed

bull This is how Apache Spark is branding itselfldquoApache Sparktrade is a fast and general engine forlarge-scale data processingrdquo httpssparkapacheorg

bull Apache Spark is a general purpose clustercomputing framework its execution modelsupports wide variety of use cases batchinteractive near-real time

bull The rapid in-memory processing of resilientdistributed datasets (RDDs) is the ldquocorecapabilityrdquo of Apache Spark

26

1 Evolution Apache Flink

bull Flink German for ldquonimble swift speedyrdquo

bull This is how Apache Flink is branding itself ldquoFast andreliable large-scale data processing enginerdquo

bull Apache Flink httpflinkapacheorg offers

bull Batch and Streaming in the same system

bull Beyond DAGs (Cyclic operator graphs)

bull Powerful expressive APIs

bull Inside-the-system iterations

bull Full Hadoop compatibility

bull Automatic language independent optimizer

bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink

27

Hadoop MapReduce vs Tez vs Spark

Criteria

License Open Source

Apache 20 version

2x

Open Source

Apache 20

version 0x

Open Source

Apache 20 version

1x

Processing

Model

On-Disk (Disk-

based

parallelization)

Batch

On-Disk Batch

Interactive

In-Memory On-Disk

Batch Interactive

Streaming (Near Real-

Time)

Language written

in

Java Java Scala

API [Java Python

Scala] User-Facing

Java[

ISVEngineTool

builder]

[Scala Java Python]

User-Facing

Libraries None separate tools None [Spark Core Spark

Streaming Spark SQL

MLlib GraphX]

28

Hadoop MapReduce vs Tez vs Spark

Criteria

Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to

Hadoop

Ease of Use Difficult to program

needs abstractions

No Interactive mode

except Hive Pig

Difficult to program

No Interactive

mode except Hive

Pig

Easy to program

no need of

abstractions

Interactive mode

Compatibilit

y

to data types and data

sources is same

to data types and

data sources is

same

to data types and

data sources is

same

YARN

integration

YARN application Ground up YARN

application

Spark is moving

towards YARN

29

Hadoop MapReduce vs Tez vs Spark

Criteria

Deployment YARN YARN [Standalone YARN

SIMR Mesos hellip]

Performance - Good performance

when data fits into

memory

- performance

degradation otherwise

Security More features and

projects

More

features and

projects

Still in its infancy

30

Partial support

IV Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

31

2 Transition

bull Existing Hadoop MapReduce projects can

migrate to Spark and leverage Spark Core as

execution engine

1 You can often reuse your mapper and

reducer functions and just call them in

Spark from Java or Scala

2 You can translate your code from

MapReduce to Apache Spark How-to

Translate from MapReduce to Apache Sparkhttpblogclouderacomblog201409how-to-translate-from-mapreduce-to-

apache-spark

32

2 Transition

3 The following tools originally based on Hadoop

MapReduce are being ported to Apache Spark

bull Pig Hive Sqoop Cascading Crunch Mahout hellip

33

Pig on Spark (Spork)

bull Run Pig with ldquondashx sparkrdquo option for an easy migration

without development effort

bull Speed up your existing pig scripts on Spark ( Query

Logical Plan Physical Pan)

bull Leverage new Spark specific operators in Pig such as

Cache

bull Still leverage many existing Pig UDF libraries

bull Pig on Spark Umbrella Jira (Status Passed end-to-end test

cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059

bull Fix outstanding issues and address additional Spark functionality

through the community

bull lsquoPig on Sparkrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag19

34

Hive on Spark (Currently in Beta

Expected in Hive 110)

bull New alternative to using MapReduce or Tez

hivegt set hiveexecutionengine=spark

bull Help existing Hive applications running on

MapReduce or Tez easily migrate to Spark without

development effort

bull Exposes Spark users to a viable feature-rich de facto

standard SQL tool on Hadoop

bull Performance benefits especially for Hive queries

involving multiple reducer stages

bull Hive on Spark Umbrella Jira (Status Open) Q1 2015httpsissuesapacheorgjirabrowseHIVE-7292

35

Hive on Spark (Currently in Beta

Expected in Hive 110)

bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-

motivations-and-design-principles

bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Start

ed

bull Hive on Spark February 11 2015 Szehon Ho

Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark

bull Hive on spark is blazing fast or is it Carter Shanklin and

Mostapah Mokhtar (Hortonworks) February 20 2015httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final

bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12

36

Sqoop on Spark

(Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially

developed as a tool to transfer data from RDBMS to

Hadoop

bull The next version of Sqoop referred to as Sqoop2

supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under

discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Pro

posal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira

Status Work In Progress) The goal of this ticket is to support a

pluggable way to select the execution engine on which we can run

the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

37

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application

development platform for building data applications on

Hadoop

bull Support for Apache Spark is on the roadmap and will be

available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the

transition from CascadingScalding to Spark a little

easier by adding support for Cascading Taps Scalding

Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark

38

Apache Crunch

bull The Apache Crunch Java library provides a

framework for writing testing and running

MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a

SparkPipeline class making it easy to migrate

data processing applications from MapReduce

to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark

Pipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-

xtopicscdh_ig_running_crunch_with_sparkhtml

39

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce

Apache Mahout the original Machine Learning (ML)

library for Hadoop since 2009 is rejecting new

MapReduce algorithm

implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark

bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this

DSL are automatically optimized and executed in

parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for

Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

40

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov

April 2014

httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with

Mahout Scala and Spark Published on May 30 2014

httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-

with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)-

MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

41

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

42

3 IntegrationService Open Source Tool

StorageServi

ng Layer

Data Formats

Data

Ingestion

Services

Resource

Management

Search

SQL

43

3 Integration

bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851

44

3 Integration

bull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via

newAPIHadoopRDD Example HBaseTestscala from

Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach

esparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available

for reading from and writing to HBase without the need

of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with

Spark Status Still in experimentation and no timetable for

possible support httpblogclouderacomblog201412new-in-cloudera-

labs-sparkonhbase

45

3 Integration

bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark

RDDs to Cassandra tables and execute arbitrary CQL

queries in your Spark applications Supports also

integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration

is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra

46

3 Integration

bull Benchmark of Spark amp Cassandra Integration

using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume

data from Cassandra to spark and store Resilient

Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new

avenues

bull Kindling An Introduction to Spark with Cassandra

(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-

spark-with-cassandra

47

3 Integration

bull MongoDB is not directly served by Spark although

it can be used from Spark via an official Mongo-

Hadoop connector

bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-

insights

bull Spark SQL also provides indirect support via its

support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

48

3 Integration

bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from

Apache Spark (still experimental)

bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-

introduction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-

example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-

example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without

Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

49

3 Integration

bull Neo4j is a highly scalable robust (fully ACID) native graph

database

bull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015

httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015

httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph

Analytics By Kenny Bastani November 3 2014

httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

50

3 Integration YARN

bull YARN Yet Another Resource Negotiator Implicit

reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND

20summary20~20yarn20AND20status203D20OPEN20ORDER20

BY20priority20DESC0A

bull Some issues are critical ones

bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

51

3 Integration

bull Spark SQL provides built in support for Hivetables

bull Import relational data from Hive tables

bull Run SQL queries over imported data

bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120

bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib

52

3 Integration

bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to

address new use cases

bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data

sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query

in-memory data in Spark Embed Drill execution in a

Spark data pipeline

Source Whats Coming in 2015 for

Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

53

3 Integration

bull Apache Kafka is a high throughput distributed

messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka

Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming

Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-

example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka

54

3 Integration

bull Apache Flume is a streaming event data

ingestion system that is designed for Big Data

ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with

Flume There are two approaches to this

bull Approach 1 Flume-style Push-based Approach

bull Approach 2 (Experimental) Pull-based

Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

55

3 Integration

bull Spark SQL provides built in support for JSON that

is vastly simplifying the end-to-end-experience of

working with JSON data

bull Spark SQL can automatically infer the schema

of a JSON dataset and load it as a

SchemaRDD No more DDL Just point Spark

SQL to JSON files and query Starting Spark 13

SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-

support-in-spark-sqlhtml

56

3 Integration

bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows to

bull Import relational data from Parquet files

bull Run SQL queries over imported data

bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

57

3 Integration

bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015

bull Problem

bull Various inbound data sets

bull Data Layout can change without notice

bull New data sets can be added without notice

Result

bull Leverage Spark to dynamically split the data

bull Leverage Avro to store the data in a compact binary format

58

3 Integration Kite SDK

bull The Kite SDK provides high level abstractions to

work with datasets on Hadoop hiding many of

the details of compression codecs file formats

partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016

release so Spark jobs can read and write to Kite

datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

59

3 Integration

bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between

Elasticsearch and Apache Spark in the form of RDD that can

read data from Elasticsearch Also any RDD can be saved to

Elasticsearch as long as its content can be translated into

documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache

Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html

60

3 Integration

bull Apache Solr added a Spark-based indexing tool for

fast and easy indexing ingestion and serving

searchable complex data ldquoCrunchIndexerTool on

Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark

Crunch and Morphlines

bull Migrate ingestion of HDFS data into Solr from

MapReduce to Spark

bull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-

intosolrusingsparktrimmed

61

3 Integration

bull HUE is the open source Apache Hadoop Web UI

that lets users use Hadoop directly from their

browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark

Igniter lets users execute and monitor Spark jobs

directly from their browser and be more

productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-

hadoop-by-enrico-berti-at-big-data-spain-2014

62

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

63

4 Complementarity

Components of Hadoop ecosystem and Spark ecosystem

can work together each for what it is especially good at

rather than choosing one of them

64

Hadoop ecosystem Spark ecosystem

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

65

4 Complementarity +

bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesos

bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41

66

4 Complementarity +

References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN

cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache

Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-

resource-management

bull YARN vs MESOS Canrsquot We All Just Get

Along httpstrataconfcombig-data-conference-ca-

2015publicscheduledetail40620

67

4 Complementarity +

bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization

strategies (building the DAG with knowledge of data

distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the

need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with

YARN (resource chaining in clusters)

bull Tez supports enterprise security

68

4 Complementarity +

bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better

since it is more ldquostream orientedrdquo has more mature

shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory

parsed data it can be much better when we process

data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native

YARN Integration httphortonworkscomblogimproving-spark-data-

pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

69

4 Complementarity

bull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal

compute framework at each step in the big data

analytics process based on the type of platform the

attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on

November 13 2014 with Matt Schumpert Director of Product

Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution

Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-

right-execution-enginehtml

70

4 Complementarity

bull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles

Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-

2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to

Mainstream Apache Hadoop Adoption February 12 2015

httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-

Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms

February 23 2015

httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-

migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015

httpblogsyncsortcom201503framework-future-hadoop

71

5 Key Takeaways1 Evolution of compute models is still ongoing

Watch out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

72

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

73

1 File System

Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml

bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift

74

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 10: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

3 Vendorsbull ldquoApache Spark is a general-purpose engine for large-

scale data processing Spark supports rapid application

development for big data and allows for code reuse

across batch interactive and streaming applications

Spark also provides advanced execution graphs with in-

memory pipelining to speed up end-to-end application

performancerdquo httpswwwmaprcomproductsapache-spark

bull MapR Adds Complete Apache Spark Stack to its

Distribution for Hadoophttpswwwmaprcomcompanypress-releasesmapr-adds-complete-apache-

spark-stack-its-distribution-hadoop

10

3 Vendorsbull ldquoApache Spark provides an elegant attractive development API and allows data workers to rapidly iterate over data via machine learning and other data science techniques that require fast in-memory data processingrdquo httphortonworkscomhadoopspark

bull Hortonworks A shared vision for Apache Spark on Hadoop October 212014httpsdatabrickscomblog20141031hortonworks-a-shared-vision-for-apache-spark-on-hadoophtml

bull ldquoAt Hortonworks we love Spark and want to help our customers leverage all its benefitsrdquo October 30th 2014httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

11

4 Analystsbull Is Apache Spark replacing Hadoop or complementing

existing Hadoop practice

bull Both are already happening

bull With uncertainty about ldquowhat is Hadooprdquo there is no

reason to think solution stacks built on Spark not

positioned as Hadoop will not continue to proliferate

as the technology matures

bull At the same time Hadoop distributions are all

embracing Spark and including it in their offerings

Source Hadoop Questions from Recent Webinar Span Spectrum

February 25 2015httpblogsgartnercommerv-adrian20150225hadoop-

questions-from-recent-webinar-span-spectrum

12

4 Analysts bull ldquoAfter hearing the confusion between Spark and

Hadoop one too many times I was inspired to write a report The Hadoop Ecosystem Overview Q4 2104

bull For those that have day jobs that donrsquot include constantly tracking Hadoop evolution I dove in and worked with Hadoop vendors and trusted consultants to create a framework

bull We divided the complex Hadoop ecosystem into a core set of tools that all work closely with data stored in Hadoop File System and extended group of components that leverage but do not require itrdquo Source Elephants Pigs Rhinos and Giraphs Oh My ndash Its Time To Get A Handle On Hadoop Posted by Brian Hopkins on November 26 2014

httpblogsforrestercombrian_hopkins14-11-26-elephants_pigs_rhinos_and_giraphs_oh_my_its_time_to_get_a_handle_on_hadoop

13

5 Key Takeaways

1 News Big Data is no longer a Hadoop

monopoly

2 Surveys Listen to what Spark developers are

saying

3 Vendors ltHadoop Vendorgt-tinted goggles

FUD is still being lsquoofferedrsquo by some Hadoop

vendors Claims need to be contextualized

4 Analysts Thorough understanding of the

market dynamics

14

II Big Data Typical Big Data

Stack Hadoop Spark

1 Big Data

2 Typical Big Data Stack

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways

15

1 Big Databull Big Data is still one of the most inflated buzzword of

the last years

bull Big Data is a broad term for data sets so large or

complex that traditional data processing tools are

inadequate httpenwikipediaorgwikiBig_data

bull Hadoop is becoming a traditional tool Above

definition is inadequate

bull ldquoBig Data refers to datasets and flows large enough

that has outpaced our capability to store process

analyze and understandrdquo Amir H Payberah

Swedish Institute of Computer Science (SICS)

16

2 Typical Big Data Stack

17

3 Apache Hadoopbull Apache Hadoop as an example of a Typical Big Data

Stack

bull Hadoop ecosystem = Hadoop Stack + many other tools

(either open source and free or commercial ones)

bull Big Data Ecosystem Dataset httpbigdataandreamostosiname

Incomplete but a useful list of Big Data related projects

packed into a JSON dataset

bull Hadoops Impact on Data Managements Future - Amr

Awadallah (Strata + Hadoop 2015) February 19 2015 Watch

video at 236 on lsquoHadoop Isnrsquot Just Hadoop Anymorersquo for a picture

representing the evolution of Apache Hadoop

httpswwwyoutubecomwatchv=1KvTZZAkHy0

18

4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stack

bull Apache Spark provides you Big Data computing and more

bull BYOS Bring Your Own Storage

bull BYOC Bring Your Own Cluster

bull Spark Core httpsparkbigdatacomcomponenttagstag11-core-spark

bull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-streaming

bull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sql

bull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllib

bull GraphX httpsparkbigdatacomcomponenttagstag6-graphx

bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned

19

5 Key Takeaways

1 Big Data Still one of the most inflated

buzzword

2 Typical Big Data Stack Big Data Stacks look

similar on paper Arenrsquot they

3 Apache Hadoop Hadoop is no longer

lsquosynonymousrsquo of Big Data

4 Apache Spark Emergence of the Apache

Spark ecosystem

20

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

21

1 Evolution of Programming APIs

bull MapReduce in Java is like assembly code of Big

Data httpwikiapacheorghadoopWordCount

bull Pig httppigapacheorg

bull Hive httphiveapacheorg

bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi

bull Cascading httpwwwcascadingorg

bull Scalding A Scala API for Cascading httptwittercomscalding

bull Crunch httpcrunchapacheorg

bull Scrunch httpcrunchapacheorgscrunchhtml

22

1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007

MapReduce v1 was the only choice as a compute model

(Execution Engine) on Hadoop Now we have in addition

to MapReduce v2 Tez Spark and Flink

23

bull Batch bull Batch

bull Interactive

bull Batch

bull Interactive

bull Near-Real

time

bull Batch

bull Interactive

bull Real-Time

bull Iterative

bull 1st

Generation

bull 2nd

Generation

bull 3rd

Generation

bull 4th

Generation

1 Evolution

bull This is how Hadoop MapReduce is branding itself ldquoA YARN-based system for parallel processing of large data sets httphadoopapacheorg

bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip

bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job

bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics

24

1 Evolution

bull Tez Hindi for ldquospeedrdquo

bull This is how Apache Tez is branding itself ldquoTheApache Tez project is aimed at building anapplication framework which allows for a complexdirected-acyclic-graph of tasks for processingdata It is currently built atop YARNrdquo

Source httptezapacheorg

bull Apachetrade Tez is an extensible framework for

building high performance batch and

interactive data processing applicationscoordinated by YARN in Apache Hadoop

25

1 Evolution

bull lsquoSparkrsquo for lightning fast speed

bull This is how Apache Spark is branding itselfldquoApache Sparktrade is a fast and general engine forlarge-scale data processingrdquo httpssparkapacheorg

bull Apache Spark is a general purpose clustercomputing framework its execution modelsupports wide variety of use cases batchinteractive near-real time

bull The rapid in-memory processing of resilientdistributed datasets (RDDs) is the ldquocorecapabilityrdquo of Apache Spark

26

1 Evolution Apache Flink

bull Flink German for ldquonimble swift speedyrdquo

bull This is how Apache Flink is branding itself ldquoFast andreliable large-scale data processing enginerdquo

bull Apache Flink httpflinkapacheorg offers

bull Batch and Streaming in the same system

bull Beyond DAGs (Cyclic operator graphs)

bull Powerful expressive APIs

bull Inside-the-system iterations

bull Full Hadoop compatibility

bull Automatic language independent optimizer

bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink

27

Hadoop MapReduce vs Tez vs Spark

Criteria

License Open Source

Apache 20 version

2x

Open Source

Apache 20

version 0x

Open Source

Apache 20 version

1x

Processing

Model

On-Disk (Disk-

based

parallelization)

Batch

On-Disk Batch

Interactive

In-Memory On-Disk

Batch Interactive

Streaming (Near Real-

Time)

Language written

in

Java Java Scala

API [Java Python

Scala] User-Facing

Java[

ISVEngineTool

builder]

[Scala Java Python]

User-Facing

Libraries None separate tools None [Spark Core Spark

Streaming Spark SQL

MLlib GraphX]

28

Hadoop MapReduce vs Tez vs Spark

Criteria

Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to

Hadoop

Ease of Use Difficult to program

needs abstractions

No Interactive mode

except Hive Pig

Difficult to program

No Interactive

mode except Hive

Pig

Easy to program

no need of

abstractions

Interactive mode

Compatibilit

y

to data types and data

sources is same

to data types and

data sources is

same

to data types and

data sources is

same

YARN

integration

YARN application Ground up YARN

application

Spark is moving

towards YARN

29

Hadoop MapReduce vs Tez vs Spark

Criteria

Deployment YARN YARN [Standalone YARN

SIMR Mesos hellip]

Performance - Good performance

when data fits into

memory

- performance

degradation otherwise

Security More features and

projects

More

features and

projects

Still in its infancy

30

Partial support

IV Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

31

2 Transition

bull Existing Hadoop MapReduce projects can

migrate to Spark and leverage Spark Core as

execution engine

1 You can often reuse your mapper and

reducer functions and just call them in

Spark from Java or Scala

2 You can translate your code from

MapReduce to Apache Spark How-to

Translate from MapReduce to Apache Sparkhttpblogclouderacomblog201409how-to-translate-from-mapreduce-to-

apache-spark

32

2 Transition

3 The following tools originally based on Hadoop

MapReduce are being ported to Apache Spark

bull Pig Hive Sqoop Cascading Crunch Mahout hellip

33

Pig on Spark (Spork)

bull Run Pig with ldquondashx sparkrdquo option for an easy migration

without development effort

bull Speed up your existing pig scripts on Spark ( Query

Logical Plan Physical Pan)

bull Leverage new Spark specific operators in Pig such as

Cache

bull Still leverage many existing Pig UDF libraries

bull Pig on Spark Umbrella Jira (Status Passed end-to-end test

cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059

bull Fix outstanding issues and address additional Spark functionality

through the community

bull lsquoPig on Sparkrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag19

34

Hive on Spark (Currently in Beta

Expected in Hive 110)

bull New alternative to using MapReduce or Tez

hivegt set hiveexecutionengine=spark

bull Help existing Hive applications running on

MapReduce or Tez easily migrate to Spark without

development effort

bull Exposes Spark users to a viable feature-rich de facto

standard SQL tool on Hadoop

bull Performance benefits especially for Hive queries

involving multiple reducer stages

bull Hive on Spark Umbrella Jira (Status Open) Q1 2015httpsissuesapacheorgjirabrowseHIVE-7292

35

Hive on Spark (Currently in Beta

Expected in Hive 110)

bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-

motivations-and-design-principles

bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Start

ed

bull Hive on Spark February 11 2015 Szehon Ho

Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark

bull Hive on spark is blazing fast or is it Carter Shanklin and

Mostapah Mokhtar (Hortonworks) February 20 2015httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final

bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12

36

Sqoop on Spark

(Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially

developed as a tool to transfer data from RDBMS to

Hadoop

bull The next version of Sqoop referred to as Sqoop2

supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under

discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Pro

posal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira

Status Work In Progress) The goal of this ticket is to support a

pluggable way to select the execution engine on which we can run

the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

37

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application

development platform for building data applications on

Hadoop

bull Support for Apache Spark is on the roadmap and will be

available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the

transition from CascadingScalding to Spark a little

easier by adding support for Cascading Taps Scalding

Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark

38

Apache Crunch

bull The Apache Crunch Java library provides a

framework for writing testing and running

MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a

SparkPipeline class making it easy to migrate

data processing applications from MapReduce

to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark

Pipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-

xtopicscdh_ig_running_crunch_with_sparkhtml

39

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce

Apache Mahout the original Machine Learning (ML)

library for Hadoop since 2009 is rejecting new

MapReduce algorithm

implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark

bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this

DSL are automatically optimized and executed in

parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for

Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

40

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov

April 2014

httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with

Mahout Scala and Spark Published on May 30 2014

httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-

with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)-

MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

41

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

42

3 IntegrationService Open Source Tool

StorageServi

ng Layer

Data Formats

Data

Ingestion

Services

Resource

Management

Search

SQL

43

3 Integration

bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851

44

3 Integration

bull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via

newAPIHadoopRDD Example HBaseTestscala from

Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach

esparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available

for reading from and writing to HBase without the need

of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with

Spark Status Still in experimentation and no timetable for

possible support httpblogclouderacomblog201412new-in-cloudera-

labs-sparkonhbase

45

3 Integration

bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark

RDDs to Cassandra tables and execute arbitrary CQL

queries in your Spark applications Supports also

integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration

is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra

46

3 Integration

bull Benchmark of Spark amp Cassandra Integration

using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume

data from Cassandra to spark and store Resilient

Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new

avenues

bull Kindling An Introduction to Spark with Cassandra

(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-

spark-with-cassandra

47

3 Integration

bull MongoDB is not directly served by Spark although

it can be used from Spark via an official Mongo-

Hadoop connector

bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-

insights

bull Spark SQL also provides indirect support via its

support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

48

3 Integration

bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from

Apache Spark (still experimental)

bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-

introduction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-

example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-

example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without

Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

49

3 Integration

bull Neo4j is a highly scalable robust (fully ACID) native graph

database

bull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015

httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015

httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph

Analytics By Kenny Bastani November 3 2014

httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

50

3 Integration YARN

bull YARN Yet Another Resource Negotiator Implicit

reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND

20summary20~20yarn20AND20status203D20OPEN20ORDER20

BY20priority20DESC0A

bull Some issues are critical ones

bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

51

3 Integration

bull Spark SQL provides built in support for Hivetables

bull Import relational data from Hive tables

bull Run SQL queries over imported data

bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120

bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib

52

3 Integration

bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to

address new use cases

bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data

sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query

in-memory data in Spark Embed Drill execution in a

Spark data pipeline

Source Whats Coming in 2015 for

Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

53

3 Integration

bull Apache Kafka is a high throughput distributed

messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka

Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming

Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-

example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka

54

3 Integration

bull Apache Flume is a streaming event data

ingestion system that is designed for Big Data

ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with

Flume There are two approaches to this

bull Approach 1 Flume-style Push-based Approach

bull Approach 2 (Experimental) Pull-based

Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

55

3 Integration

bull Spark SQL provides built in support for JSON that

is vastly simplifying the end-to-end-experience of

working with JSON data

bull Spark SQL can automatically infer the schema

of a JSON dataset and load it as a

SchemaRDD No more DDL Just point Spark

SQL to JSON files and query Starting Spark 13

SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-

support-in-spark-sqlhtml

56

3 Integration

bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows to

bull Import relational data from Parquet files

bull Run SQL queries over imported data

bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

57

3 Integration

bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015

bull Problem

bull Various inbound data sets

bull Data Layout can change without notice

bull New data sets can be added without notice

Result

bull Leverage Spark to dynamically split the data

bull Leverage Avro to store the data in a compact binary format

58

3 Integration Kite SDK

bull The Kite SDK provides high level abstractions to

work with datasets on Hadoop hiding many of

the details of compression codecs file formats

partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016

release so Spark jobs can read and write to Kite

datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

59

3 Integration

bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between

Elasticsearch and Apache Spark in the form of RDD that can

read data from Elasticsearch Also any RDD can be saved to

Elasticsearch as long as its content can be translated into

documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache

Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html

60

3 Integration

bull Apache Solr added a Spark-based indexing tool for

fast and easy indexing ingestion and serving

searchable complex data ldquoCrunchIndexerTool on

Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark

Crunch and Morphlines

bull Migrate ingestion of HDFS data into Solr from

MapReduce to Spark

bull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-

intosolrusingsparktrimmed

61

3 Integration

bull HUE is the open source Apache Hadoop Web UI

that lets users use Hadoop directly from their

browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark

Igniter lets users execute and monitor Spark jobs

directly from their browser and be more

productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-

hadoop-by-enrico-berti-at-big-data-spain-2014

62

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

63

4 Complementarity

Components of Hadoop ecosystem and Spark ecosystem

can work together each for what it is especially good at

rather than choosing one of them

64

Hadoop ecosystem Spark ecosystem

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

65

4 Complementarity +

bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesos

bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41

66

4 Complementarity +

References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN

cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache

Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-

resource-management

bull YARN vs MESOS Canrsquot We All Just Get

Along httpstrataconfcombig-data-conference-ca-

2015publicscheduledetail40620

67

4 Complementarity +

bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization

strategies (building the DAG with knowledge of data

distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the

need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with

YARN (resource chaining in clusters)

bull Tez supports enterprise security

68

4 Complementarity +

bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better

since it is more ldquostream orientedrdquo has more mature

shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory

parsed data it can be much better when we process

data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native

YARN Integration httphortonworkscomblogimproving-spark-data-

pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

69

4 Complementarity

bull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal

compute framework at each step in the big data

analytics process based on the type of platform the

attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on

November 13 2014 with Matt Schumpert Director of Product

Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution

Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-

right-execution-enginehtml

70

4 Complementarity

bull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles

Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-

2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to

Mainstream Apache Hadoop Adoption February 12 2015

httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-

Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms

February 23 2015

httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-

migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015

httpblogsyncsortcom201503framework-future-hadoop

71

5 Key Takeaways1 Evolution of compute models is still ongoing

Watch out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

72

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

73

1 File System

Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml

bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift

74

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 11: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

3 Vendorsbull ldquoApache Spark provides an elegant attractive development API and allows data workers to rapidly iterate over data via machine learning and other data science techniques that require fast in-memory data processingrdquo httphortonworkscomhadoopspark

bull Hortonworks A shared vision for Apache Spark on Hadoop October 212014httpsdatabrickscomblog20141031hortonworks-a-shared-vision-for-apache-spark-on-hadoophtml

bull ldquoAt Hortonworks we love Spark and want to help our customers leverage all its benefitsrdquo October 30th 2014httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

11

4 Analystsbull Is Apache Spark replacing Hadoop or complementing

existing Hadoop practice

bull Both are already happening

bull With uncertainty about ldquowhat is Hadooprdquo there is no

reason to think solution stacks built on Spark not

positioned as Hadoop will not continue to proliferate

as the technology matures

bull At the same time Hadoop distributions are all

embracing Spark and including it in their offerings

Source Hadoop Questions from Recent Webinar Span Spectrum

February 25 2015httpblogsgartnercommerv-adrian20150225hadoop-

questions-from-recent-webinar-span-spectrum

12

4 Analysts bull ldquoAfter hearing the confusion between Spark and

Hadoop one too many times I was inspired to write a report The Hadoop Ecosystem Overview Q4 2104

bull For those that have day jobs that donrsquot include constantly tracking Hadoop evolution I dove in and worked with Hadoop vendors and trusted consultants to create a framework

bull We divided the complex Hadoop ecosystem into a core set of tools that all work closely with data stored in Hadoop File System and extended group of components that leverage but do not require itrdquo Source Elephants Pigs Rhinos and Giraphs Oh My ndash Its Time To Get A Handle On Hadoop Posted by Brian Hopkins on November 26 2014

httpblogsforrestercombrian_hopkins14-11-26-elephants_pigs_rhinos_and_giraphs_oh_my_its_time_to_get_a_handle_on_hadoop

13

5 Key Takeaways

1 News Big Data is no longer a Hadoop

monopoly

2 Surveys Listen to what Spark developers are

saying

3 Vendors ltHadoop Vendorgt-tinted goggles

FUD is still being lsquoofferedrsquo by some Hadoop

vendors Claims need to be contextualized

4 Analysts Thorough understanding of the

market dynamics

14

II Big Data Typical Big Data

Stack Hadoop Spark

1 Big Data

2 Typical Big Data Stack

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways

15

1 Big Databull Big Data is still one of the most inflated buzzword of

the last years

bull Big Data is a broad term for data sets so large or

complex that traditional data processing tools are

inadequate httpenwikipediaorgwikiBig_data

bull Hadoop is becoming a traditional tool Above

definition is inadequate

bull ldquoBig Data refers to datasets and flows large enough

that has outpaced our capability to store process

analyze and understandrdquo Amir H Payberah

Swedish Institute of Computer Science (SICS)

16

2 Typical Big Data Stack

17

3 Apache Hadoopbull Apache Hadoop as an example of a Typical Big Data

Stack

bull Hadoop ecosystem = Hadoop Stack + many other tools

(either open source and free or commercial ones)

bull Big Data Ecosystem Dataset httpbigdataandreamostosiname

Incomplete but a useful list of Big Data related projects

packed into a JSON dataset

bull Hadoops Impact on Data Managements Future - Amr

Awadallah (Strata + Hadoop 2015) February 19 2015 Watch

video at 236 on lsquoHadoop Isnrsquot Just Hadoop Anymorersquo for a picture

representing the evolution of Apache Hadoop

httpswwwyoutubecomwatchv=1KvTZZAkHy0

18

4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stack

bull Apache Spark provides you Big Data computing and more

bull BYOS Bring Your Own Storage

bull BYOC Bring Your Own Cluster

bull Spark Core httpsparkbigdatacomcomponenttagstag11-core-spark

bull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-streaming

bull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sql

bull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllib

bull GraphX httpsparkbigdatacomcomponenttagstag6-graphx

bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned

19

5 Key Takeaways

1 Big Data Still one of the most inflated

buzzword

2 Typical Big Data Stack Big Data Stacks look

similar on paper Arenrsquot they

3 Apache Hadoop Hadoop is no longer

lsquosynonymousrsquo of Big Data

4 Apache Spark Emergence of the Apache

Spark ecosystem

20

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

21

1 Evolution of Programming APIs

bull MapReduce in Java is like assembly code of Big

Data httpwikiapacheorghadoopWordCount

bull Pig httppigapacheorg

bull Hive httphiveapacheorg

bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi

bull Cascading httpwwwcascadingorg

bull Scalding A Scala API for Cascading httptwittercomscalding

bull Crunch httpcrunchapacheorg

bull Scrunch httpcrunchapacheorgscrunchhtml

22

1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007

MapReduce v1 was the only choice as a compute model

(Execution Engine) on Hadoop Now we have in addition

to MapReduce v2 Tez Spark and Flink

23

bull Batch bull Batch

bull Interactive

bull Batch

bull Interactive

bull Near-Real

time

bull Batch

bull Interactive

bull Real-Time

bull Iterative

bull 1st

Generation

bull 2nd

Generation

bull 3rd

Generation

bull 4th

Generation

1 Evolution

bull This is how Hadoop MapReduce is branding itself ldquoA YARN-based system for parallel processing of large data sets httphadoopapacheorg

bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip

bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job

bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics

24

1 Evolution

bull Tez Hindi for ldquospeedrdquo

bull This is how Apache Tez is branding itself ldquoTheApache Tez project is aimed at building anapplication framework which allows for a complexdirected-acyclic-graph of tasks for processingdata It is currently built atop YARNrdquo

Source httptezapacheorg

bull Apachetrade Tez is an extensible framework for

building high performance batch and

interactive data processing applicationscoordinated by YARN in Apache Hadoop

25

1 Evolution

bull lsquoSparkrsquo for lightning fast speed

bull This is how Apache Spark is branding itselfldquoApache Sparktrade is a fast and general engine forlarge-scale data processingrdquo httpssparkapacheorg

bull Apache Spark is a general purpose clustercomputing framework its execution modelsupports wide variety of use cases batchinteractive near-real time

bull The rapid in-memory processing of resilientdistributed datasets (RDDs) is the ldquocorecapabilityrdquo of Apache Spark

26

1 Evolution Apache Flink

bull Flink German for ldquonimble swift speedyrdquo

bull This is how Apache Flink is branding itself ldquoFast andreliable large-scale data processing enginerdquo

bull Apache Flink httpflinkapacheorg offers

bull Batch and Streaming in the same system

bull Beyond DAGs (Cyclic operator graphs)

bull Powerful expressive APIs

bull Inside-the-system iterations

bull Full Hadoop compatibility

bull Automatic language independent optimizer

bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink

27

Hadoop MapReduce vs Tez vs Spark

Criteria

License Open Source

Apache 20 version

2x

Open Source

Apache 20

version 0x

Open Source

Apache 20 version

1x

Processing

Model

On-Disk (Disk-

based

parallelization)

Batch

On-Disk Batch

Interactive

In-Memory On-Disk

Batch Interactive

Streaming (Near Real-

Time)

Language written

in

Java Java Scala

API [Java Python

Scala] User-Facing

Java[

ISVEngineTool

builder]

[Scala Java Python]

User-Facing

Libraries None separate tools None [Spark Core Spark

Streaming Spark SQL

MLlib GraphX]

28

Hadoop MapReduce vs Tez vs Spark

Criteria

Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to

Hadoop

Ease of Use Difficult to program

needs abstractions

No Interactive mode

except Hive Pig

Difficult to program

No Interactive

mode except Hive

Pig

Easy to program

no need of

abstractions

Interactive mode

Compatibilit

y

to data types and data

sources is same

to data types and

data sources is

same

to data types and

data sources is

same

YARN

integration

YARN application Ground up YARN

application

Spark is moving

towards YARN

29

Hadoop MapReduce vs Tez vs Spark

Criteria

Deployment YARN YARN [Standalone YARN

SIMR Mesos hellip]

Performance - Good performance

when data fits into

memory

- performance

degradation otherwise

Security More features and

projects

More

features and

projects

Still in its infancy

30

Partial support

IV Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

31

2 Transition

bull Existing Hadoop MapReduce projects can

migrate to Spark and leverage Spark Core as

execution engine

1 You can often reuse your mapper and

reducer functions and just call them in

Spark from Java or Scala

2 You can translate your code from

MapReduce to Apache Spark How-to

Translate from MapReduce to Apache Sparkhttpblogclouderacomblog201409how-to-translate-from-mapreduce-to-

apache-spark

32

2 Transition

3 The following tools originally based on Hadoop

MapReduce are being ported to Apache Spark

bull Pig Hive Sqoop Cascading Crunch Mahout hellip

33

Pig on Spark (Spork)

bull Run Pig with ldquondashx sparkrdquo option for an easy migration

without development effort

bull Speed up your existing pig scripts on Spark ( Query

Logical Plan Physical Pan)

bull Leverage new Spark specific operators in Pig such as

Cache

bull Still leverage many existing Pig UDF libraries

bull Pig on Spark Umbrella Jira (Status Passed end-to-end test

cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059

bull Fix outstanding issues and address additional Spark functionality

through the community

bull lsquoPig on Sparkrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag19

34

Hive on Spark (Currently in Beta

Expected in Hive 110)

bull New alternative to using MapReduce or Tez

hivegt set hiveexecutionengine=spark

bull Help existing Hive applications running on

MapReduce or Tez easily migrate to Spark without

development effort

bull Exposes Spark users to a viable feature-rich de facto

standard SQL tool on Hadoop

bull Performance benefits especially for Hive queries

involving multiple reducer stages

bull Hive on Spark Umbrella Jira (Status Open) Q1 2015httpsissuesapacheorgjirabrowseHIVE-7292

35

Hive on Spark (Currently in Beta

Expected in Hive 110)

bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-

motivations-and-design-principles

bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Start

ed

bull Hive on Spark February 11 2015 Szehon Ho

Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark

bull Hive on spark is blazing fast or is it Carter Shanklin and

Mostapah Mokhtar (Hortonworks) February 20 2015httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final

bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12

36

Sqoop on Spark

(Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially

developed as a tool to transfer data from RDBMS to

Hadoop

bull The next version of Sqoop referred to as Sqoop2

supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under

discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Pro

posal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira

Status Work In Progress) The goal of this ticket is to support a

pluggable way to select the execution engine on which we can run

the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

37

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application

development platform for building data applications on

Hadoop

bull Support for Apache Spark is on the roadmap and will be

available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the

transition from CascadingScalding to Spark a little

easier by adding support for Cascading Taps Scalding

Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark

38

Apache Crunch

bull The Apache Crunch Java library provides a

framework for writing testing and running

MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a

SparkPipeline class making it easy to migrate

data processing applications from MapReduce

to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark

Pipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-

xtopicscdh_ig_running_crunch_with_sparkhtml

39

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce

Apache Mahout the original Machine Learning (ML)

library for Hadoop since 2009 is rejecting new

MapReduce algorithm

implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark

bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this

DSL are automatically optimized and executed in

parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for

Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

40

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov

April 2014

httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with

Mahout Scala and Spark Published on May 30 2014

httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-

with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)-

MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

41

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

42

3 IntegrationService Open Source Tool

StorageServi

ng Layer

Data Formats

Data

Ingestion

Services

Resource

Management

Search

SQL

43

3 Integration

bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851

44

3 Integration

bull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via

newAPIHadoopRDD Example HBaseTestscala from

Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach

esparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available

for reading from and writing to HBase without the need

of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with

Spark Status Still in experimentation and no timetable for

possible support httpblogclouderacomblog201412new-in-cloudera-

labs-sparkonhbase

45

3 Integration

bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark

RDDs to Cassandra tables and execute arbitrary CQL

queries in your Spark applications Supports also

integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration

is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra

46

3 Integration

bull Benchmark of Spark amp Cassandra Integration

using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume

data from Cassandra to spark and store Resilient

Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new

avenues

bull Kindling An Introduction to Spark with Cassandra

(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-

spark-with-cassandra

47

3 Integration

bull MongoDB is not directly served by Spark although

it can be used from Spark via an official Mongo-

Hadoop connector

bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-

insights

bull Spark SQL also provides indirect support via its

support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

48

3 Integration

bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from

Apache Spark (still experimental)

bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-

introduction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-

example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-

example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without

Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

49

3 Integration

bull Neo4j is a highly scalable robust (fully ACID) native graph

database

bull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015

httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015

httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph

Analytics By Kenny Bastani November 3 2014

httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

50

3 Integration YARN

bull YARN Yet Another Resource Negotiator Implicit

reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND

20summary20~20yarn20AND20status203D20OPEN20ORDER20

BY20priority20DESC0A

bull Some issues are critical ones

bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

51

3 Integration

bull Spark SQL provides built in support for Hivetables

bull Import relational data from Hive tables

bull Run SQL queries over imported data

bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120

bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib

52

3 Integration

bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to

address new use cases

bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data

sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query

in-memory data in Spark Embed Drill execution in a

Spark data pipeline

Source Whats Coming in 2015 for

Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

53

3 Integration

bull Apache Kafka is a high throughput distributed

messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka

Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming

Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-

example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka

54

3 Integration

bull Apache Flume is a streaming event data

ingestion system that is designed for Big Data

ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with

Flume There are two approaches to this

bull Approach 1 Flume-style Push-based Approach

bull Approach 2 (Experimental) Pull-based

Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

55

3 Integration

bull Spark SQL provides built in support for JSON that

is vastly simplifying the end-to-end-experience of

working with JSON data

bull Spark SQL can automatically infer the schema

of a JSON dataset and load it as a

SchemaRDD No more DDL Just point Spark

SQL to JSON files and query Starting Spark 13

SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-

support-in-spark-sqlhtml

56

3 Integration

bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows to

bull Import relational data from Parquet files

bull Run SQL queries over imported data

bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

57

3 Integration

bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015

bull Problem

bull Various inbound data sets

bull Data Layout can change without notice

bull New data sets can be added without notice

Result

bull Leverage Spark to dynamically split the data

bull Leverage Avro to store the data in a compact binary format

58

3 Integration Kite SDK

bull The Kite SDK provides high level abstractions to

work with datasets on Hadoop hiding many of

the details of compression codecs file formats

partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016

release so Spark jobs can read and write to Kite

datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

59

3 Integration

bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between

Elasticsearch and Apache Spark in the form of RDD that can

read data from Elasticsearch Also any RDD can be saved to

Elasticsearch as long as its content can be translated into

documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache

Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html

60

3 Integration

bull Apache Solr added a Spark-based indexing tool for

fast and easy indexing ingestion and serving

searchable complex data ldquoCrunchIndexerTool on

Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark

Crunch and Morphlines

bull Migrate ingestion of HDFS data into Solr from

MapReduce to Spark

bull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-

intosolrusingsparktrimmed

61

3 Integration

bull HUE is the open source Apache Hadoop Web UI

that lets users use Hadoop directly from their

browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark

Igniter lets users execute and monitor Spark jobs

directly from their browser and be more

productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-

hadoop-by-enrico-berti-at-big-data-spain-2014

62

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

63

4 Complementarity

Components of Hadoop ecosystem and Spark ecosystem

can work together each for what it is especially good at

rather than choosing one of them

64

Hadoop ecosystem Spark ecosystem

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

65

4 Complementarity +

bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesos

bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41

66

4 Complementarity +

References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN

cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache

Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-

resource-management

bull YARN vs MESOS Canrsquot We All Just Get

Along httpstrataconfcombig-data-conference-ca-

2015publicscheduledetail40620

67

4 Complementarity +

bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization

strategies (building the DAG with knowledge of data

distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the

need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with

YARN (resource chaining in clusters)

bull Tez supports enterprise security

68

4 Complementarity +

bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better

since it is more ldquostream orientedrdquo has more mature

shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory

parsed data it can be much better when we process

data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native

YARN Integration httphortonworkscomblogimproving-spark-data-

pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

69

4 Complementarity

bull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal

compute framework at each step in the big data

analytics process based on the type of platform the

attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on

November 13 2014 with Matt Schumpert Director of Product

Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution

Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-

right-execution-enginehtml

70

4 Complementarity

bull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles

Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-

2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to

Mainstream Apache Hadoop Adoption February 12 2015

httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-

Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms

February 23 2015

httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-

migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015

httpblogsyncsortcom201503framework-future-hadoop

71

5 Key Takeaways1 Evolution of compute models is still ongoing

Watch out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

72

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

73

1 File System

Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml

bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift

74

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 12: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

4 Analystsbull Is Apache Spark replacing Hadoop or complementing

existing Hadoop practice

bull Both are already happening

bull With uncertainty about ldquowhat is Hadooprdquo there is no

reason to think solution stacks built on Spark not

positioned as Hadoop will not continue to proliferate

as the technology matures

bull At the same time Hadoop distributions are all

embracing Spark and including it in their offerings

Source Hadoop Questions from Recent Webinar Span Spectrum

February 25 2015httpblogsgartnercommerv-adrian20150225hadoop-

questions-from-recent-webinar-span-spectrum

12

4 Analysts bull ldquoAfter hearing the confusion between Spark and

Hadoop one too many times I was inspired to write a report The Hadoop Ecosystem Overview Q4 2104

bull For those that have day jobs that donrsquot include constantly tracking Hadoop evolution I dove in and worked with Hadoop vendors and trusted consultants to create a framework

bull We divided the complex Hadoop ecosystem into a core set of tools that all work closely with data stored in Hadoop File System and extended group of components that leverage but do not require itrdquo Source Elephants Pigs Rhinos and Giraphs Oh My ndash Its Time To Get A Handle On Hadoop Posted by Brian Hopkins on November 26 2014

httpblogsforrestercombrian_hopkins14-11-26-elephants_pigs_rhinos_and_giraphs_oh_my_its_time_to_get_a_handle_on_hadoop

13

5 Key Takeaways

1 News Big Data is no longer a Hadoop

monopoly

2 Surveys Listen to what Spark developers are

saying

3 Vendors ltHadoop Vendorgt-tinted goggles

FUD is still being lsquoofferedrsquo by some Hadoop

vendors Claims need to be contextualized

4 Analysts Thorough understanding of the

market dynamics

14

II Big Data Typical Big Data

Stack Hadoop Spark

1 Big Data

2 Typical Big Data Stack

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways

15

1 Big Databull Big Data is still one of the most inflated buzzword of

the last years

bull Big Data is a broad term for data sets so large or

complex that traditional data processing tools are

inadequate httpenwikipediaorgwikiBig_data

bull Hadoop is becoming a traditional tool Above

definition is inadequate

bull ldquoBig Data refers to datasets and flows large enough

that has outpaced our capability to store process

analyze and understandrdquo Amir H Payberah

Swedish Institute of Computer Science (SICS)

16

2 Typical Big Data Stack

17

3 Apache Hadoopbull Apache Hadoop as an example of a Typical Big Data

Stack

bull Hadoop ecosystem = Hadoop Stack + many other tools

(either open source and free or commercial ones)

bull Big Data Ecosystem Dataset httpbigdataandreamostosiname

Incomplete but a useful list of Big Data related projects

packed into a JSON dataset

bull Hadoops Impact on Data Managements Future - Amr

Awadallah (Strata + Hadoop 2015) February 19 2015 Watch

video at 236 on lsquoHadoop Isnrsquot Just Hadoop Anymorersquo for a picture

representing the evolution of Apache Hadoop

httpswwwyoutubecomwatchv=1KvTZZAkHy0

18

4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stack

bull Apache Spark provides you Big Data computing and more

bull BYOS Bring Your Own Storage

bull BYOC Bring Your Own Cluster

bull Spark Core httpsparkbigdatacomcomponenttagstag11-core-spark

bull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-streaming

bull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sql

bull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllib

bull GraphX httpsparkbigdatacomcomponenttagstag6-graphx

bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned

19

5 Key Takeaways

1 Big Data Still one of the most inflated

buzzword

2 Typical Big Data Stack Big Data Stacks look

similar on paper Arenrsquot they

3 Apache Hadoop Hadoop is no longer

lsquosynonymousrsquo of Big Data

4 Apache Spark Emergence of the Apache

Spark ecosystem

20

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

21

1 Evolution of Programming APIs

bull MapReduce in Java is like assembly code of Big

Data httpwikiapacheorghadoopWordCount

bull Pig httppigapacheorg

bull Hive httphiveapacheorg

bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi

bull Cascading httpwwwcascadingorg

bull Scalding A Scala API for Cascading httptwittercomscalding

bull Crunch httpcrunchapacheorg

bull Scrunch httpcrunchapacheorgscrunchhtml

22

1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007

MapReduce v1 was the only choice as a compute model

(Execution Engine) on Hadoop Now we have in addition

to MapReduce v2 Tez Spark and Flink

23

bull Batch bull Batch

bull Interactive

bull Batch

bull Interactive

bull Near-Real

time

bull Batch

bull Interactive

bull Real-Time

bull Iterative

bull 1st

Generation

bull 2nd

Generation

bull 3rd

Generation

bull 4th

Generation

1 Evolution

bull This is how Hadoop MapReduce is branding itself ldquoA YARN-based system for parallel processing of large data sets httphadoopapacheorg

bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip

bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job

bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics

24

1 Evolution

bull Tez Hindi for ldquospeedrdquo

bull This is how Apache Tez is branding itself ldquoTheApache Tez project is aimed at building anapplication framework which allows for a complexdirected-acyclic-graph of tasks for processingdata It is currently built atop YARNrdquo

Source httptezapacheorg

bull Apachetrade Tez is an extensible framework for

building high performance batch and

interactive data processing applicationscoordinated by YARN in Apache Hadoop

25

1 Evolution

bull lsquoSparkrsquo for lightning fast speed

bull This is how Apache Spark is branding itselfldquoApache Sparktrade is a fast and general engine forlarge-scale data processingrdquo httpssparkapacheorg

bull Apache Spark is a general purpose clustercomputing framework its execution modelsupports wide variety of use cases batchinteractive near-real time

bull The rapid in-memory processing of resilientdistributed datasets (RDDs) is the ldquocorecapabilityrdquo of Apache Spark

26

1 Evolution Apache Flink

bull Flink German for ldquonimble swift speedyrdquo

bull This is how Apache Flink is branding itself ldquoFast andreliable large-scale data processing enginerdquo

bull Apache Flink httpflinkapacheorg offers

bull Batch and Streaming in the same system

bull Beyond DAGs (Cyclic operator graphs)

bull Powerful expressive APIs

bull Inside-the-system iterations

bull Full Hadoop compatibility

bull Automatic language independent optimizer

bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink

27

Hadoop MapReduce vs Tez vs Spark

Criteria

License Open Source

Apache 20 version

2x

Open Source

Apache 20

version 0x

Open Source

Apache 20 version

1x

Processing

Model

On-Disk (Disk-

based

parallelization)

Batch

On-Disk Batch

Interactive

In-Memory On-Disk

Batch Interactive

Streaming (Near Real-

Time)

Language written

in

Java Java Scala

API [Java Python

Scala] User-Facing

Java[

ISVEngineTool

builder]

[Scala Java Python]

User-Facing

Libraries None separate tools None [Spark Core Spark

Streaming Spark SQL

MLlib GraphX]

28

Hadoop MapReduce vs Tez vs Spark

Criteria

Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to

Hadoop

Ease of Use Difficult to program

needs abstractions

No Interactive mode

except Hive Pig

Difficult to program

No Interactive

mode except Hive

Pig

Easy to program

no need of

abstractions

Interactive mode

Compatibilit

y

to data types and data

sources is same

to data types and

data sources is

same

to data types and

data sources is

same

YARN

integration

YARN application Ground up YARN

application

Spark is moving

towards YARN

29

Hadoop MapReduce vs Tez vs Spark

Criteria

Deployment YARN YARN [Standalone YARN

SIMR Mesos hellip]

Performance - Good performance

when data fits into

memory

- performance

degradation otherwise

Security More features and

projects

More

features and

projects

Still in its infancy

30

Partial support

IV Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

31

2 Transition

bull Existing Hadoop MapReduce projects can

migrate to Spark and leverage Spark Core as

execution engine

1 You can often reuse your mapper and

reducer functions and just call them in

Spark from Java or Scala

2 You can translate your code from

MapReduce to Apache Spark How-to

Translate from MapReduce to Apache Sparkhttpblogclouderacomblog201409how-to-translate-from-mapreduce-to-

apache-spark

32

2 Transition

3 The following tools originally based on Hadoop

MapReduce are being ported to Apache Spark

bull Pig Hive Sqoop Cascading Crunch Mahout hellip

33

Pig on Spark (Spork)

bull Run Pig with ldquondashx sparkrdquo option for an easy migration

without development effort

bull Speed up your existing pig scripts on Spark ( Query

Logical Plan Physical Pan)

bull Leverage new Spark specific operators in Pig such as

Cache

bull Still leverage many existing Pig UDF libraries

bull Pig on Spark Umbrella Jira (Status Passed end-to-end test

cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059

bull Fix outstanding issues and address additional Spark functionality

through the community

bull lsquoPig on Sparkrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag19

34

Hive on Spark (Currently in Beta

Expected in Hive 110)

bull New alternative to using MapReduce or Tez

hivegt set hiveexecutionengine=spark

bull Help existing Hive applications running on

MapReduce or Tez easily migrate to Spark without

development effort

bull Exposes Spark users to a viable feature-rich de facto

standard SQL tool on Hadoop

bull Performance benefits especially for Hive queries

involving multiple reducer stages

bull Hive on Spark Umbrella Jira (Status Open) Q1 2015httpsissuesapacheorgjirabrowseHIVE-7292

35

Hive on Spark (Currently in Beta

Expected in Hive 110)

bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-

motivations-and-design-principles

bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Start

ed

bull Hive on Spark February 11 2015 Szehon Ho

Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark

bull Hive on spark is blazing fast or is it Carter Shanklin and

Mostapah Mokhtar (Hortonworks) February 20 2015httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final

bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12

36

Sqoop on Spark

(Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially

developed as a tool to transfer data from RDBMS to

Hadoop

bull The next version of Sqoop referred to as Sqoop2

supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under

discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Pro

posal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira

Status Work In Progress) The goal of this ticket is to support a

pluggable way to select the execution engine on which we can run

the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

37

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application

development platform for building data applications on

Hadoop

bull Support for Apache Spark is on the roadmap and will be

available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the

transition from CascadingScalding to Spark a little

easier by adding support for Cascading Taps Scalding

Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark

38

Apache Crunch

bull The Apache Crunch Java library provides a

framework for writing testing and running

MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a

SparkPipeline class making it easy to migrate

data processing applications from MapReduce

to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark

Pipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-

xtopicscdh_ig_running_crunch_with_sparkhtml

39

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce

Apache Mahout the original Machine Learning (ML)

library for Hadoop since 2009 is rejecting new

MapReduce algorithm

implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark

bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this

DSL are automatically optimized and executed in

parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for

Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

40

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov

April 2014

httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with

Mahout Scala and Spark Published on May 30 2014

httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-

with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)-

MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

41

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

42

3 IntegrationService Open Source Tool

StorageServi

ng Layer

Data Formats

Data

Ingestion

Services

Resource

Management

Search

SQL

43

3 Integration

bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851

44

3 Integration

bull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via

newAPIHadoopRDD Example HBaseTestscala from

Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach

esparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available

for reading from and writing to HBase without the need

of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with

Spark Status Still in experimentation and no timetable for

possible support httpblogclouderacomblog201412new-in-cloudera-

labs-sparkonhbase

45

3 Integration

bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark

RDDs to Cassandra tables and execute arbitrary CQL

queries in your Spark applications Supports also

integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration

is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra

46

3 Integration

bull Benchmark of Spark amp Cassandra Integration

using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume

data from Cassandra to spark and store Resilient

Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new

avenues

bull Kindling An Introduction to Spark with Cassandra

(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-

spark-with-cassandra

47

3 Integration

bull MongoDB is not directly served by Spark although

it can be used from Spark via an official Mongo-

Hadoop connector

bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-

insights

bull Spark SQL also provides indirect support via its

support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

48

3 Integration

bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from

Apache Spark (still experimental)

bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-

introduction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-

example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-

example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without

Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

49

3 Integration

bull Neo4j is a highly scalable robust (fully ACID) native graph

database

bull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015

httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015

httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph

Analytics By Kenny Bastani November 3 2014

httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

50

3 Integration YARN

bull YARN Yet Another Resource Negotiator Implicit

reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND

20summary20~20yarn20AND20status203D20OPEN20ORDER20

BY20priority20DESC0A

bull Some issues are critical ones

bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

51

3 Integration

bull Spark SQL provides built in support for Hivetables

bull Import relational data from Hive tables

bull Run SQL queries over imported data

bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120

bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib

52

3 Integration

bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to

address new use cases

bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data

sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query

in-memory data in Spark Embed Drill execution in a

Spark data pipeline

Source Whats Coming in 2015 for

Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

53

3 Integration

bull Apache Kafka is a high throughput distributed

messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka

Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming

Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-

example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka

54

3 Integration

bull Apache Flume is a streaming event data

ingestion system that is designed for Big Data

ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with

Flume There are two approaches to this

bull Approach 1 Flume-style Push-based Approach

bull Approach 2 (Experimental) Pull-based

Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

55

3 Integration

bull Spark SQL provides built in support for JSON that

is vastly simplifying the end-to-end-experience of

working with JSON data

bull Spark SQL can automatically infer the schema

of a JSON dataset and load it as a

SchemaRDD No more DDL Just point Spark

SQL to JSON files and query Starting Spark 13

SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-

support-in-spark-sqlhtml

56

3 Integration

bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows to

bull Import relational data from Parquet files

bull Run SQL queries over imported data

bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

57

3 Integration

bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015

bull Problem

bull Various inbound data sets

bull Data Layout can change without notice

bull New data sets can be added without notice

Result

bull Leverage Spark to dynamically split the data

bull Leverage Avro to store the data in a compact binary format

58

3 Integration Kite SDK

bull The Kite SDK provides high level abstractions to

work with datasets on Hadoop hiding many of

the details of compression codecs file formats

partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016

release so Spark jobs can read and write to Kite

datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

59

3 Integration

bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between

Elasticsearch and Apache Spark in the form of RDD that can

read data from Elasticsearch Also any RDD can be saved to

Elasticsearch as long as its content can be translated into

documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache

Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html

60

3 Integration

bull Apache Solr added a Spark-based indexing tool for

fast and easy indexing ingestion and serving

searchable complex data ldquoCrunchIndexerTool on

Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark

Crunch and Morphlines

bull Migrate ingestion of HDFS data into Solr from

MapReduce to Spark

bull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-

intosolrusingsparktrimmed

61

3 Integration

bull HUE is the open source Apache Hadoop Web UI

that lets users use Hadoop directly from their

browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark

Igniter lets users execute and monitor Spark jobs

directly from their browser and be more

productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-

hadoop-by-enrico-berti-at-big-data-spain-2014

62

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

63

4 Complementarity

Components of Hadoop ecosystem and Spark ecosystem

can work together each for what it is especially good at

rather than choosing one of them

64

Hadoop ecosystem Spark ecosystem

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

65

4 Complementarity +

bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesos

bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41

66

4 Complementarity +

References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN

cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache

Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-

resource-management

bull YARN vs MESOS Canrsquot We All Just Get

Along httpstrataconfcombig-data-conference-ca-

2015publicscheduledetail40620

67

4 Complementarity +

bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization

strategies (building the DAG with knowledge of data

distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the

need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with

YARN (resource chaining in clusters)

bull Tez supports enterprise security

68

4 Complementarity +

bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better

since it is more ldquostream orientedrdquo has more mature

shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory

parsed data it can be much better when we process

data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native

YARN Integration httphortonworkscomblogimproving-spark-data-

pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

69

4 Complementarity

bull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal

compute framework at each step in the big data

analytics process based on the type of platform the

attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on

November 13 2014 with Matt Schumpert Director of Product

Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution

Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-

right-execution-enginehtml

70

4 Complementarity

bull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles

Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-

2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to

Mainstream Apache Hadoop Adoption February 12 2015

httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-

Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms

February 23 2015

httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-

migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015

httpblogsyncsortcom201503framework-future-hadoop

71

5 Key Takeaways1 Evolution of compute models is still ongoing

Watch out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

72

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

73

1 File System

Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml

bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift

74

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 13: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

4 Analysts bull ldquoAfter hearing the confusion between Spark and

Hadoop one too many times I was inspired to write a report The Hadoop Ecosystem Overview Q4 2104

bull For those that have day jobs that donrsquot include constantly tracking Hadoop evolution I dove in and worked with Hadoop vendors and trusted consultants to create a framework

bull We divided the complex Hadoop ecosystem into a core set of tools that all work closely with data stored in Hadoop File System and extended group of components that leverage but do not require itrdquo Source Elephants Pigs Rhinos and Giraphs Oh My ndash Its Time To Get A Handle On Hadoop Posted by Brian Hopkins on November 26 2014

httpblogsforrestercombrian_hopkins14-11-26-elephants_pigs_rhinos_and_giraphs_oh_my_its_time_to_get_a_handle_on_hadoop

13

5 Key Takeaways

1 News Big Data is no longer a Hadoop

monopoly

2 Surveys Listen to what Spark developers are

saying

3 Vendors ltHadoop Vendorgt-tinted goggles

FUD is still being lsquoofferedrsquo by some Hadoop

vendors Claims need to be contextualized

4 Analysts Thorough understanding of the

market dynamics

14

II Big Data Typical Big Data

Stack Hadoop Spark

1 Big Data

2 Typical Big Data Stack

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways

15

1 Big Databull Big Data is still one of the most inflated buzzword of

the last years

bull Big Data is a broad term for data sets so large or

complex that traditional data processing tools are

inadequate httpenwikipediaorgwikiBig_data

bull Hadoop is becoming a traditional tool Above

definition is inadequate

bull ldquoBig Data refers to datasets and flows large enough

that has outpaced our capability to store process

analyze and understandrdquo Amir H Payberah

Swedish Institute of Computer Science (SICS)

16

2 Typical Big Data Stack

17

3 Apache Hadoopbull Apache Hadoop as an example of a Typical Big Data

Stack

bull Hadoop ecosystem = Hadoop Stack + many other tools

(either open source and free or commercial ones)

bull Big Data Ecosystem Dataset httpbigdataandreamostosiname

Incomplete but a useful list of Big Data related projects

packed into a JSON dataset

bull Hadoops Impact on Data Managements Future - Amr

Awadallah (Strata + Hadoop 2015) February 19 2015 Watch

video at 236 on lsquoHadoop Isnrsquot Just Hadoop Anymorersquo for a picture

representing the evolution of Apache Hadoop

httpswwwyoutubecomwatchv=1KvTZZAkHy0

18

4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stack

bull Apache Spark provides you Big Data computing and more

bull BYOS Bring Your Own Storage

bull BYOC Bring Your Own Cluster

bull Spark Core httpsparkbigdatacomcomponenttagstag11-core-spark

bull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-streaming

bull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sql

bull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllib

bull GraphX httpsparkbigdatacomcomponenttagstag6-graphx

bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned

19

5 Key Takeaways

1 Big Data Still one of the most inflated

buzzword

2 Typical Big Data Stack Big Data Stacks look

similar on paper Arenrsquot they

3 Apache Hadoop Hadoop is no longer

lsquosynonymousrsquo of Big Data

4 Apache Spark Emergence of the Apache

Spark ecosystem

20

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

21

1 Evolution of Programming APIs

bull MapReduce in Java is like assembly code of Big

Data httpwikiapacheorghadoopWordCount

bull Pig httppigapacheorg

bull Hive httphiveapacheorg

bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi

bull Cascading httpwwwcascadingorg

bull Scalding A Scala API for Cascading httptwittercomscalding

bull Crunch httpcrunchapacheorg

bull Scrunch httpcrunchapacheorgscrunchhtml

22

1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007

MapReduce v1 was the only choice as a compute model

(Execution Engine) on Hadoop Now we have in addition

to MapReduce v2 Tez Spark and Flink

23

bull Batch bull Batch

bull Interactive

bull Batch

bull Interactive

bull Near-Real

time

bull Batch

bull Interactive

bull Real-Time

bull Iterative

bull 1st

Generation

bull 2nd

Generation

bull 3rd

Generation

bull 4th

Generation

1 Evolution

bull This is how Hadoop MapReduce is branding itself ldquoA YARN-based system for parallel processing of large data sets httphadoopapacheorg

bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip

bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job

bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics

24

1 Evolution

bull Tez Hindi for ldquospeedrdquo

bull This is how Apache Tez is branding itself ldquoTheApache Tez project is aimed at building anapplication framework which allows for a complexdirected-acyclic-graph of tasks for processingdata It is currently built atop YARNrdquo

Source httptezapacheorg

bull Apachetrade Tez is an extensible framework for

building high performance batch and

interactive data processing applicationscoordinated by YARN in Apache Hadoop

25

1 Evolution

bull lsquoSparkrsquo for lightning fast speed

bull This is how Apache Spark is branding itselfldquoApache Sparktrade is a fast and general engine forlarge-scale data processingrdquo httpssparkapacheorg

bull Apache Spark is a general purpose clustercomputing framework its execution modelsupports wide variety of use cases batchinteractive near-real time

bull The rapid in-memory processing of resilientdistributed datasets (RDDs) is the ldquocorecapabilityrdquo of Apache Spark

26

1 Evolution Apache Flink

bull Flink German for ldquonimble swift speedyrdquo

bull This is how Apache Flink is branding itself ldquoFast andreliable large-scale data processing enginerdquo

bull Apache Flink httpflinkapacheorg offers

bull Batch and Streaming in the same system

bull Beyond DAGs (Cyclic operator graphs)

bull Powerful expressive APIs

bull Inside-the-system iterations

bull Full Hadoop compatibility

bull Automatic language independent optimizer

bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink

27

Hadoop MapReduce vs Tez vs Spark

Criteria

License Open Source

Apache 20 version

2x

Open Source

Apache 20

version 0x

Open Source

Apache 20 version

1x

Processing

Model

On-Disk (Disk-

based

parallelization)

Batch

On-Disk Batch

Interactive

In-Memory On-Disk

Batch Interactive

Streaming (Near Real-

Time)

Language written

in

Java Java Scala

API [Java Python

Scala] User-Facing

Java[

ISVEngineTool

builder]

[Scala Java Python]

User-Facing

Libraries None separate tools None [Spark Core Spark

Streaming Spark SQL

MLlib GraphX]

28

Hadoop MapReduce vs Tez vs Spark

Criteria

Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to

Hadoop

Ease of Use Difficult to program

needs abstractions

No Interactive mode

except Hive Pig

Difficult to program

No Interactive

mode except Hive

Pig

Easy to program

no need of

abstractions

Interactive mode

Compatibilit

y

to data types and data

sources is same

to data types and

data sources is

same

to data types and

data sources is

same

YARN

integration

YARN application Ground up YARN

application

Spark is moving

towards YARN

29

Hadoop MapReduce vs Tez vs Spark

Criteria

Deployment YARN YARN [Standalone YARN

SIMR Mesos hellip]

Performance - Good performance

when data fits into

memory

- performance

degradation otherwise

Security More features and

projects

More

features and

projects

Still in its infancy

30

Partial support

IV Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

31

2 Transition

bull Existing Hadoop MapReduce projects can

migrate to Spark and leverage Spark Core as

execution engine

1 You can often reuse your mapper and

reducer functions and just call them in

Spark from Java or Scala

2 You can translate your code from

MapReduce to Apache Spark How-to

Translate from MapReduce to Apache Sparkhttpblogclouderacomblog201409how-to-translate-from-mapreduce-to-

apache-spark

32

2 Transition

3 The following tools originally based on Hadoop

MapReduce are being ported to Apache Spark

bull Pig Hive Sqoop Cascading Crunch Mahout hellip

33

Pig on Spark (Spork)

bull Run Pig with ldquondashx sparkrdquo option for an easy migration

without development effort

bull Speed up your existing pig scripts on Spark ( Query

Logical Plan Physical Pan)

bull Leverage new Spark specific operators in Pig such as

Cache

bull Still leverage many existing Pig UDF libraries

bull Pig on Spark Umbrella Jira (Status Passed end-to-end test

cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059

bull Fix outstanding issues and address additional Spark functionality

through the community

bull lsquoPig on Sparkrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag19

34

Hive on Spark (Currently in Beta

Expected in Hive 110)

bull New alternative to using MapReduce or Tez

hivegt set hiveexecutionengine=spark

bull Help existing Hive applications running on

MapReduce or Tez easily migrate to Spark without

development effort

bull Exposes Spark users to a viable feature-rich de facto

standard SQL tool on Hadoop

bull Performance benefits especially for Hive queries

involving multiple reducer stages

bull Hive on Spark Umbrella Jira (Status Open) Q1 2015httpsissuesapacheorgjirabrowseHIVE-7292

35

Hive on Spark (Currently in Beta

Expected in Hive 110)

bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-

motivations-and-design-principles

bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Start

ed

bull Hive on Spark February 11 2015 Szehon Ho

Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark

bull Hive on spark is blazing fast or is it Carter Shanklin and

Mostapah Mokhtar (Hortonworks) February 20 2015httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final

bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12

36

Sqoop on Spark

(Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially

developed as a tool to transfer data from RDBMS to

Hadoop

bull The next version of Sqoop referred to as Sqoop2

supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under

discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Pro

posal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira

Status Work In Progress) The goal of this ticket is to support a

pluggable way to select the execution engine on which we can run

the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

37

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application

development platform for building data applications on

Hadoop

bull Support for Apache Spark is on the roadmap and will be

available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the

transition from CascadingScalding to Spark a little

easier by adding support for Cascading Taps Scalding

Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark

38

Apache Crunch

bull The Apache Crunch Java library provides a

framework for writing testing and running

MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a

SparkPipeline class making it easy to migrate

data processing applications from MapReduce

to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark

Pipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-

xtopicscdh_ig_running_crunch_with_sparkhtml

39

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce

Apache Mahout the original Machine Learning (ML)

library for Hadoop since 2009 is rejecting new

MapReduce algorithm

implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark

bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this

DSL are automatically optimized and executed in

parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for

Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

40

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov

April 2014

httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with

Mahout Scala and Spark Published on May 30 2014

httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-

with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)-

MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

41

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

42

3 IntegrationService Open Source Tool

StorageServi

ng Layer

Data Formats

Data

Ingestion

Services

Resource

Management

Search

SQL

43

3 Integration

bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851

44

3 Integration

bull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via

newAPIHadoopRDD Example HBaseTestscala from

Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach

esparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available

for reading from and writing to HBase without the need

of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with

Spark Status Still in experimentation and no timetable for

possible support httpblogclouderacomblog201412new-in-cloudera-

labs-sparkonhbase

45

3 Integration

bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark

RDDs to Cassandra tables and execute arbitrary CQL

queries in your Spark applications Supports also

integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration

is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra

46

3 Integration

bull Benchmark of Spark amp Cassandra Integration

using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume

data from Cassandra to spark and store Resilient

Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new

avenues

bull Kindling An Introduction to Spark with Cassandra

(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-

spark-with-cassandra

47

3 Integration

bull MongoDB is not directly served by Spark although

it can be used from Spark via an official Mongo-

Hadoop connector

bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-

insights

bull Spark SQL also provides indirect support via its

support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

48

3 Integration

bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from

Apache Spark (still experimental)

bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-

introduction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-

example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-

example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without

Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

49

3 Integration

bull Neo4j is a highly scalable robust (fully ACID) native graph

database

bull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015

httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015

httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph

Analytics By Kenny Bastani November 3 2014

httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

50

3 Integration YARN

bull YARN Yet Another Resource Negotiator Implicit

reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND

20summary20~20yarn20AND20status203D20OPEN20ORDER20

BY20priority20DESC0A

bull Some issues are critical ones

bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

51

3 Integration

bull Spark SQL provides built in support for Hivetables

bull Import relational data from Hive tables

bull Run SQL queries over imported data

bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120

bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib

52

3 Integration

bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to

address new use cases

bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data

sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query

in-memory data in Spark Embed Drill execution in a

Spark data pipeline

Source Whats Coming in 2015 for

Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

53

3 Integration

bull Apache Kafka is a high throughput distributed

messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka

Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming

Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-

example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka

54

3 Integration

bull Apache Flume is a streaming event data

ingestion system that is designed for Big Data

ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with

Flume There are two approaches to this

bull Approach 1 Flume-style Push-based Approach

bull Approach 2 (Experimental) Pull-based

Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

55

3 Integration

bull Spark SQL provides built in support for JSON that

is vastly simplifying the end-to-end-experience of

working with JSON data

bull Spark SQL can automatically infer the schema

of a JSON dataset and load it as a

SchemaRDD No more DDL Just point Spark

SQL to JSON files and query Starting Spark 13

SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-

support-in-spark-sqlhtml

56

3 Integration

bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows to

bull Import relational data from Parquet files

bull Run SQL queries over imported data

bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

57

3 Integration

bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015

bull Problem

bull Various inbound data sets

bull Data Layout can change without notice

bull New data sets can be added without notice

Result

bull Leverage Spark to dynamically split the data

bull Leverage Avro to store the data in a compact binary format

58

3 Integration Kite SDK

bull The Kite SDK provides high level abstractions to

work with datasets on Hadoop hiding many of

the details of compression codecs file formats

partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016

release so Spark jobs can read and write to Kite

datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

59

3 Integration

bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between

Elasticsearch and Apache Spark in the form of RDD that can

read data from Elasticsearch Also any RDD can be saved to

Elasticsearch as long as its content can be translated into

documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache

Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html

60

3 Integration

bull Apache Solr added a Spark-based indexing tool for

fast and easy indexing ingestion and serving

searchable complex data ldquoCrunchIndexerTool on

Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark

Crunch and Morphlines

bull Migrate ingestion of HDFS data into Solr from

MapReduce to Spark

bull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-

intosolrusingsparktrimmed

61

3 Integration

bull HUE is the open source Apache Hadoop Web UI

that lets users use Hadoop directly from their

browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark

Igniter lets users execute and monitor Spark jobs

directly from their browser and be more

productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-

hadoop-by-enrico-berti-at-big-data-spain-2014

62

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

63

4 Complementarity

Components of Hadoop ecosystem and Spark ecosystem

can work together each for what it is especially good at

rather than choosing one of them

64

Hadoop ecosystem Spark ecosystem

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

65

4 Complementarity +

bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesos

bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41

66

4 Complementarity +

References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN

cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache

Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-

resource-management

bull YARN vs MESOS Canrsquot We All Just Get

Along httpstrataconfcombig-data-conference-ca-

2015publicscheduledetail40620

67

4 Complementarity +

bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization

strategies (building the DAG with knowledge of data

distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the

need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with

YARN (resource chaining in clusters)

bull Tez supports enterprise security

68

4 Complementarity +

bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better

since it is more ldquostream orientedrdquo has more mature

shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory

parsed data it can be much better when we process

data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native

YARN Integration httphortonworkscomblogimproving-spark-data-

pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

69

4 Complementarity

bull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal

compute framework at each step in the big data

analytics process based on the type of platform the

attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on

November 13 2014 with Matt Schumpert Director of Product

Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution

Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-

right-execution-enginehtml

70

4 Complementarity

bull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles

Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-

2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to

Mainstream Apache Hadoop Adoption February 12 2015

httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-

Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms

February 23 2015

httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-

migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015

httpblogsyncsortcom201503framework-future-hadoop

71

5 Key Takeaways1 Evolution of compute models is still ongoing

Watch out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

72

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

73

1 File System

Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml

bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift

74

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 14: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

5 Key Takeaways

1 News Big Data is no longer a Hadoop

monopoly

2 Surveys Listen to what Spark developers are

saying

3 Vendors ltHadoop Vendorgt-tinted goggles

FUD is still being lsquoofferedrsquo by some Hadoop

vendors Claims need to be contextualized

4 Analysts Thorough understanding of the

market dynamics

14

II Big Data Typical Big Data

Stack Hadoop Spark

1 Big Data

2 Typical Big Data Stack

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways

15

1 Big Databull Big Data is still one of the most inflated buzzword of

the last years

bull Big Data is a broad term for data sets so large or

complex that traditional data processing tools are

inadequate httpenwikipediaorgwikiBig_data

bull Hadoop is becoming a traditional tool Above

definition is inadequate

bull ldquoBig Data refers to datasets and flows large enough

that has outpaced our capability to store process

analyze and understandrdquo Amir H Payberah

Swedish Institute of Computer Science (SICS)

16

2 Typical Big Data Stack

17

3 Apache Hadoopbull Apache Hadoop as an example of a Typical Big Data

Stack

bull Hadoop ecosystem = Hadoop Stack + many other tools

(either open source and free or commercial ones)

bull Big Data Ecosystem Dataset httpbigdataandreamostosiname

Incomplete but a useful list of Big Data related projects

packed into a JSON dataset

bull Hadoops Impact on Data Managements Future - Amr

Awadallah (Strata + Hadoop 2015) February 19 2015 Watch

video at 236 on lsquoHadoop Isnrsquot Just Hadoop Anymorersquo for a picture

representing the evolution of Apache Hadoop

httpswwwyoutubecomwatchv=1KvTZZAkHy0

18

4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stack

bull Apache Spark provides you Big Data computing and more

bull BYOS Bring Your Own Storage

bull BYOC Bring Your Own Cluster

bull Spark Core httpsparkbigdatacomcomponenttagstag11-core-spark

bull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-streaming

bull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sql

bull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllib

bull GraphX httpsparkbigdatacomcomponenttagstag6-graphx

bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned

19

5 Key Takeaways

1 Big Data Still one of the most inflated

buzzword

2 Typical Big Data Stack Big Data Stacks look

similar on paper Arenrsquot they

3 Apache Hadoop Hadoop is no longer

lsquosynonymousrsquo of Big Data

4 Apache Spark Emergence of the Apache

Spark ecosystem

20

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

21

1 Evolution of Programming APIs

bull MapReduce in Java is like assembly code of Big

Data httpwikiapacheorghadoopWordCount

bull Pig httppigapacheorg

bull Hive httphiveapacheorg

bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi

bull Cascading httpwwwcascadingorg

bull Scalding A Scala API for Cascading httptwittercomscalding

bull Crunch httpcrunchapacheorg

bull Scrunch httpcrunchapacheorgscrunchhtml

22

1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007

MapReduce v1 was the only choice as a compute model

(Execution Engine) on Hadoop Now we have in addition

to MapReduce v2 Tez Spark and Flink

23

bull Batch bull Batch

bull Interactive

bull Batch

bull Interactive

bull Near-Real

time

bull Batch

bull Interactive

bull Real-Time

bull Iterative

bull 1st

Generation

bull 2nd

Generation

bull 3rd

Generation

bull 4th

Generation

1 Evolution

bull This is how Hadoop MapReduce is branding itself ldquoA YARN-based system for parallel processing of large data sets httphadoopapacheorg

bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip

bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job

bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics

24

1 Evolution

bull Tez Hindi for ldquospeedrdquo

bull This is how Apache Tez is branding itself ldquoTheApache Tez project is aimed at building anapplication framework which allows for a complexdirected-acyclic-graph of tasks for processingdata It is currently built atop YARNrdquo

Source httptezapacheorg

bull Apachetrade Tez is an extensible framework for

building high performance batch and

interactive data processing applicationscoordinated by YARN in Apache Hadoop

25

1 Evolution

bull lsquoSparkrsquo for lightning fast speed

bull This is how Apache Spark is branding itselfldquoApache Sparktrade is a fast and general engine forlarge-scale data processingrdquo httpssparkapacheorg

bull Apache Spark is a general purpose clustercomputing framework its execution modelsupports wide variety of use cases batchinteractive near-real time

bull The rapid in-memory processing of resilientdistributed datasets (RDDs) is the ldquocorecapabilityrdquo of Apache Spark

26

1 Evolution Apache Flink

bull Flink German for ldquonimble swift speedyrdquo

bull This is how Apache Flink is branding itself ldquoFast andreliable large-scale data processing enginerdquo

bull Apache Flink httpflinkapacheorg offers

bull Batch and Streaming in the same system

bull Beyond DAGs (Cyclic operator graphs)

bull Powerful expressive APIs

bull Inside-the-system iterations

bull Full Hadoop compatibility

bull Automatic language independent optimizer

bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink

27

Hadoop MapReduce vs Tez vs Spark

Criteria

License Open Source

Apache 20 version

2x

Open Source

Apache 20

version 0x

Open Source

Apache 20 version

1x

Processing

Model

On-Disk (Disk-

based

parallelization)

Batch

On-Disk Batch

Interactive

In-Memory On-Disk

Batch Interactive

Streaming (Near Real-

Time)

Language written

in

Java Java Scala

API [Java Python

Scala] User-Facing

Java[

ISVEngineTool

builder]

[Scala Java Python]

User-Facing

Libraries None separate tools None [Spark Core Spark

Streaming Spark SQL

MLlib GraphX]

28

Hadoop MapReduce vs Tez vs Spark

Criteria

Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to

Hadoop

Ease of Use Difficult to program

needs abstractions

No Interactive mode

except Hive Pig

Difficult to program

No Interactive

mode except Hive

Pig

Easy to program

no need of

abstractions

Interactive mode

Compatibilit

y

to data types and data

sources is same

to data types and

data sources is

same

to data types and

data sources is

same

YARN

integration

YARN application Ground up YARN

application

Spark is moving

towards YARN

29

Hadoop MapReduce vs Tez vs Spark

Criteria

Deployment YARN YARN [Standalone YARN

SIMR Mesos hellip]

Performance - Good performance

when data fits into

memory

- performance

degradation otherwise

Security More features and

projects

More

features and

projects

Still in its infancy

30

Partial support

IV Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

31

2 Transition

bull Existing Hadoop MapReduce projects can

migrate to Spark and leverage Spark Core as

execution engine

1 You can often reuse your mapper and

reducer functions and just call them in

Spark from Java or Scala

2 You can translate your code from

MapReduce to Apache Spark How-to

Translate from MapReduce to Apache Sparkhttpblogclouderacomblog201409how-to-translate-from-mapreduce-to-

apache-spark

32

2 Transition

3 The following tools originally based on Hadoop

MapReduce are being ported to Apache Spark

bull Pig Hive Sqoop Cascading Crunch Mahout hellip

33

Pig on Spark (Spork)

bull Run Pig with ldquondashx sparkrdquo option for an easy migration

without development effort

bull Speed up your existing pig scripts on Spark ( Query

Logical Plan Physical Pan)

bull Leverage new Spark specific operators in Pig such as

Cache

bull Still leverage many existing Pig UDF libraries

bull Pig on Spark Umbrella Jira (Status Passed end-to-end test

cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059

bull Fix outstanding issues and address additional Spark functionality

through the community

bull lsquoPig on Sparkrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag19

34

Hive on Spark (Currently in Beta

Expected in Hive 110)

bull New alternative to using MapReduce or Tez

hivegt set hiveexecutionengine=spark

bull Help existing Hive applications running on

MapReduce or Tez easily migrate to Spark without

development effort

bull Exposes Spark users to a viable feature-rich de facto

standard SQL tool on Hadoop

bull Performance benefits especially for Hive queries

involving multiple reducer stages

bull Hive on Spark Umbrella Jira (Status Open) Q1 2015httpsissuesapacheorgjirabrowseHIVE-7292

35

Hive on Spark (Currently in Beta

Expected in Hive 110)

bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-

motivations-and-design-principles

bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Start

ed

bull Hive on Spark February 11 2015 Szehon Ho

Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark

bull Hive on spark is blazing fast or is it Carter Shanklin and

Mostapah Mokhtar (Hortonworks) February 20 2015httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final

bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12

36

Sqoop on Spark

(Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially

developed as a tool to transfer data from RDBMS to

Hadoop

bull The next version of Sqoop referred to as Sqoop2

supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under

discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Pro

posal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira

Status Work In Progress) The goal of this ticket is to support a

pluggable way to select the execution engine on which we can run

the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

37

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application

development platform for building data applications on

Hadoop

bull Support for Apache Spark is on the roadmap and will be

available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the

transition from CascadingScalding to Spark a little

easier by adding support for Cascading Taps Scalding

Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark

38

Apache Crunch

bull The Apache Crunch Java library provides a

framework for writing testing and running

MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a

SparkPipeline class making it easy to migrate

data processing applications from MapReduce

to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark

Pipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-

xtopicscdh_ig_running_crunch_with_sparkhtml

39

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce

Apache Mahout the original Machine Learning (ML)

library for Hadoop since 2009 is rejecting new

MapReduce algorithm

implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark

bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this

DSL are automatically optimized and executed in

parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for

Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

40

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov

April 2014

httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with

Mahout Scala and Spark Published on May 30 2014

httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-

with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)-

MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

41

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

42

3 IntegrationService Open Source Tool

StorageServi

ng Layer

Data Formats

Data

Ingestion

Services

Resource

Management

Search

SQL

43

3 Integration

bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851

44

3 Integration

bull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via

newAPIHadoopRDD Example HBaseTestscala from

Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach

esparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available

for reading from and writing to HBase without the need

of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with

Spark Status Still in experimentation and no timetable for

possible support httpblogclouderacomblog201412new-in-cloudera-

labs-sparkonhbase

45

3 Integration

bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark

RDDs to Cassandra tables and execute arbitrary CQL

queries in your Spark applications Supports also

integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration

is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra

46

3 Integration

bull Benchmark of Spark amp Cassandra Integration

using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume

data from Cassandra to spark and store Resilient

Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new

avenues

bull Kindling An Introduction to Spark with Cassandra

(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-

spark-with-cassandra

47

3 Integration

bull MongoDB is not directly served by Spark although

it can be used from Spark via an official Mongo-

Hadoop connector

bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-

insights

bull Spark SQL also provides indirect support via its

support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

48

3 Integration

bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from

Apache Spark (still experimental)

bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-

introduction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-

example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-

example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without

Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

49

3 Integration

bull Neo4j is a highly scalable robust (fully ACID) native graph

database

bull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015

httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015

httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph

Analytics By Kenny Bastani November 3 2014

httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

50

3 Integration YARN

bull YARN Yet Another Resource Negotiator Implicit

reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND

20summary20~20yarn20AND20status203D20OPEN20ORDER20

BY20priority20DESC0A

bull Some issues are critical ones

bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

51

3 Integration

bull Spark SQL provides built in support for Hivetables

bull Import relational data from Hive tables

bull Run SQL queries over imported data

bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120

bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib

52

3 Integration

bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to

address new use cases

bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data

sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query

in-memory data in Spark Embed Drill execution in a

Spark data pipeline

Source Whats Coming in 2015 for

Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

53

3 Integration

bull Apache Kafka is a high throughput distributed

messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka

Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming

Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-

example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka

54

3 Integration

bull Apache Flume is a streaming event data

ingestion system that is designed for Big Data

ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with

Flume There are two approaches to this

bull Approach 1 Flume-style Push-based Approach

bull Approach 2 (Experimental) Pull-based

Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

55

3 Integration

bull Spark SQL provides built in support for JSON that

is vastly simplifying the end-to-end-experience of

working with JSON data

bull Spark SQL can automatically infer the schema

of a JSON dataset and load it as a

SchemaRDD No more DDL Just point Spark

SQL to JSON files and query Starting Spark 13

SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-

support-in-spark-sqlhtml

56

3 Integration

bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows to

bull Import relational data from Parquet files

bull Run SQL queries over imported data

bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

57

3 Integration

bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015

bull Problem

bull Various inbound data sets

bull Data Layout can change without notice

bull New data sets can be added without notice

Result

bull Leverage Spark to dynamically split the data

bull Leverage Avro to store the data in a compact binary format

58

3 Integration Kite SDK

bull The Kite SDK provides high level abstractions to

work with datasets on Hadoop hiding many of

the details of compression codecs file formats

partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016

release so Spark jobs can read and write to Kite

datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

59

3 Integration

bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between

Elasticsearch and Apache Spark in the form of RDD that can

read data from Elasticsearch Also any RDD can be saved to

Elasticsearch as long as its content can be translated into

documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache

Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html

60

3 Integration

bull Apache Solr added a Spark-based indexing tool for

fast and easy indexing ingestion and serving

searchable complex data ldquoCrunchIndexerTool on

Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark

Crunch and Morphlines

bull Migrate ingestion of HDFS data into Solr from

MapReduce to Spark

bull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-

intosolrusingsparktrimmed

61

3 Integration

bull HUE is the open source Apache Hadoop Web UI

that lets users use Hadoop directly from their

browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark

Igniter lets users execute and monitor Spark jobs

directly from their browser and be more

productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-

hadoop-by-enrico-berti-at-big-data-spain-2014

62

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

63

4 Complementarity

Components of Hadoop ecosystem and Spark ecosystem

can work together each for what it is especially good at

rather than choosing one of them

64

Hadoop ecosystem Spark ecosystem

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

65

4 Complementarity +

bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesos

bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41

66

4 Complementarity +

References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN

cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache

Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-

resource-management

bull YARN vs MESOS Canrsquot We All Just Get

Along httpstrataconfcombig-data-conference-ca-

2015publicscheduledetail40620

67

4 Complementarity +

bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization

strategies (building the DAG with knowledge of data

distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the

need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with

YARN (resource chaining in clusters)

bull Tez supports enterprise security

68

4 Complementarity +

bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better

since it is more ldquostream orientedrdquo has more mature

shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory

parsed data it can be much better when we process

data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native

YARN Integration httphortonworkscomblogimproving-spark-data-

pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

69

4 Complementarity

bull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal

compute framework at each step in the big data

analytics process based on the type of platform the

attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on

November 13 2014 with Matt Schumpert Director of Product

Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution

Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-

right-execution-enginehtml

70

4 Complementarity

bull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles

Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-

2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to

Mainstream Apache Hadoop Adoption February 12 2015

httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-

Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms

February 23 2015

httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-

migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015

httpblogsyncsortcom201503framework-future-hadoop

71

5 Key Takeaways1 Evolution of compute models is still ongoing

Watch out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

72

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

73

1 File System

Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml

bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift

74

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 15: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

II Big Data Typical Big Data

Stack Hadoop Spark

1 Big Data

2 Typical Big Data Stack

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways

15

1 Big Databull Big Data is still one of the most inflated buzzword of

the last years

bull Big Data is a broad term for data sets so large or

complex that traditional data processing tools are

inadequate httpenwikipediaorgwikiBig_data

bull Hadoop is becoming a traditional tool Above

definition is inadequate

bull ldquoBig Data refers to datasets and flows large enough

that has outpaced our capability to store process

analyze and understandrdquo Amir H Payberah

Swedish Institute of Computer Science (SICS)

16

2 Typical Big Data Stack

17

3 Apache Hadoopbull Apache Hadoop as an example of a Typical Big Data

Stack

bull Hadoop ecosystem = Hadoop Stack + many other tools

(either open source and free or commercial ones)

bull Big Data Ecosystem Dataset httpbigdataandreamostosiname

Incomplete but a useful list of Big Data related projects

packed into a JSON dataset

bull Hadoops Impact on Data Managements Future - Amr

Awadallah (Strata + Hadoop 2015) February 19 2015 Watch

video at 236 on lsquoHadoop Isnrsquot Just Hadoop Anymorersquo for a picture

representing the evolution of Apache Hadoop

httpswwwyoutubecomwatchv=1KvTZZAkHy0

18

4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stack

bull Apache Spark provides you Big Data computing and more

bull BYOS Bring Your Own Storage

bull BYOC Bring Your Own Cluster

bull Spark Core httpsparkbigdatacomcomponenttagstag11-core-spark

bull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-streaming

bull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sql

bull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllib

bull GraphX httpsparkbigdatacomcomponenttagstag6-graphx

bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned

19

5 Key Takeaways

1 Big Data Still one of the most inflated

buzzword

2 Typical Big Data Stack Big Data Stacks look

similar on paper Arenrsquot they

3 Apache Hadoop Hadoop is no longer

lsquosynonymousrsquo of Big Data

4 Apache Spark Emergence of the Apache

Spark ecosystem

20

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

21

1 Evolution of Programming APIs

bull MapReduce in Java is like assembly code of Big

Data httpwikiapacheorghadoopWordCount

bull Pig httppigapacheorg

bull Hive httphiveapacheorg

bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi

bull Cascading httpwwwcascadingorg

bull Scalding A Scala API for Cascading httptwittercomscalding

bull Crunch httpcrunchapacheorg

bull Scrunch httpcrunchapacheorgscrunchhtml

22

1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007

MapReduce v1 was the only choice as a compute model

(Execution Engine) on Hadoop Now we have in addition

to MapReduce v2 Tez Spark and Flink

23

bull Batch bull Batch

bull Interactive

bull Batch

bull Interactive

bull Near-Real

time

bull Batch

bull Interactive

bull Real-Time

bull Iterative

bull 1st

Generation

bull 2nd

Generation

bull 3rd

Generation

bull 4th

Generation

1 Evolution

bull This is how Hadoop MapReduce is branding itself ldquoA YARN-based system for parallel processing of large data sets httphadoopapacheorg

bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip

bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job

bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics

24

1 Evolution

bull Tez Hindi for ldquospeedrdquo

bull This is how Apache Tez is branding itself ldquoTheApache Tez project is aimed at building anapplication framework which allows for a complexdirected-acyclic-graph of tasks for processingdata It is currently built atop YARNrdquo

Source httptezapacheorg

bull Apachetrade Tez is an extensible framework for

building high performance batch and

interactive data processing applicationscoordinated by YARN in Apache Hadoop

25

1 Evolution

bull lsquoSparkrsquo for lightning fast speed

bull This is how Apache Spark is branding itselfldquoApache Sparktrade is a fast and general engine forlarge-scale data processingrdquo httpssparkapacheorg

bull Apache Spark is a general purpose clustercomputing framework its execution modelsupports wide variety of use cases batchinteractive near-real time

bull The rapid in-memory processing of resilientdistributed datasets (RDDs) is the ldquocorecapabilityrdquo of Apache Spark

26

1 Evolution Apache Flink

bull Flink German for ldquonimble swift speedyrdquo

bull This is how Apache Flink is branding itself ldquoFast andreliable large-scale data processing enginerdquo

bull Apache Flink httpflinkapacheorg offers

bull Batch and Streaming in the same system

bull Beyond DAGs (Cyclic operator graphs)

bull Powerful expressive APIs

bull Inside-the-system iterations

bull Full Hadoop compatibility

bull Automatic language independent optimizer

bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink

27

Hadoop MapReduce vs Tez vs Spark

Criteria

License Open Source

Apache 20 version

2x

Open Source

Apache 20

version 0x

Open Source

Apache 20 version

1x

Processing

Model

On-Disk (Disk-

based

parallelization)

Batch

On-Disk Batch

Interactive

In-Memory On-Disk

Batch Interactive

Streaming (Near Real-

Time)

Language written

in

Java Java Scala

API [Java Python

Scala] User-Facing

Java[

ISVEngineTool

builder]

[Scala Java Python]

User-Facing

Libraries None separate tools None [Spark Core Spark

Streaming Spark SQL

MLlib GraphX]

28

Hadoop MapReduce vs Tez vs Spark

Criteria

Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to

Hadoop

Ease of Use Difficult to program

needs abstractions

No Interactive mode

except Hive Pig

Difficult to program

No Interactive

mode except Hive

Pig

Easy to program

no need of

abstractions

Interactive mode

Compatibilit

y

to data types and data

sources is same

to data types and

data sources is

same

to data types and

data sources is

same

YARN

integration

YARN application Ground up YARN

application

Spark is moving

towards YARN

29

Hadoop MapReduce vs Tez vs Spark

Criteria

Deployment YARN YARN [Standalone YARN

SIMR Mesos hellip]

Performance - Good performance

when data fits into

memory

- performance

degradation otherwise

Security More features and

projects

More

features and

projects

Still in its infancy

30

Partial support

IV Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

31

2 Transition

bull Existing Hadoop MapReduce projects can

migrate to Spark and leverage Spark Core as

execution engine

1 You can often reuse your mapper and

reducer functions and just call them in

Spark from Java or Scala

2 You can translate your code from

MapReduce to Apache Spark How-to

Translate from MapReduce to Apache Sparkhttpblogclouderacomblog201409how-to-translate-from-mapreduce-to-

apache-spark

32

2 Transition

3 The following tools originally based on Hadoop

MapReduce are being ported to Apache Spark

bull Pig Hive Sqoop Cascading Crunch Mahout hellip

33

Pig on Spark (Spork)

bull Run Pig with ldquondashx sparkrdquo option for an easy migration

without development effort

bull Speed up your existing pig scripts on Spark ( Query

Logical Plan Physical Pan)

bull Leverage new Spark specific operators in Pig such as

Cache

bull Still leverage many existing Pig UDF libraries

bull Pig on Spark Umbrella Jira (Status Passed end-to-end test

cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059

bull Fix outstanding issues and address additional Spark functionality

through the community

bull lsquoPig on Sparkrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag19

34

Hive on Spark (Currently in Beta

Expected in Hive 110)

bull New alternative to using MapReduce or Tez

hivegt set hiveexecutionengine=spark

bull Help existing Hive applications running on

MapReduce or Tez easily migrate to Spark without

development effort

bull Exposes Spark users to a viable feature-rich de facto

standard SQL tool on Hadoop

bull Performance benefits especially for Hive queries

involving multiple reducer stages

bull Hive on Spark Umbrella Jira (Status Open) Q1 2015httpsissuesapacheorgjirabrowseHIVE-7292

35

Hive on Spark (Currently in Beta

Expected in Hive 110)

bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-

motivations-and-design-principles

bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Start

ed

bull Hive on Spark February 11 2015 Szehon Ho

Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark

bull Hive on spark is blazing fast or is it Carter Shanklin and

Mostapah Mokhtar (Hortonworks) February 20 2015httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final

bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12

36

Sqoop on Spark

(Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially

developed as a tool to transfer data from RDBMS to

Hadoop

bull The next version of Sqoop referred to as Sqoop2

supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under

discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Pro

posal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira

Status Work In Progress) The goal of this ticket is to support a

pluggable way to select the execution engine on which we can run

the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

37

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application

development platform for building data applications on

Hadoop

bull Support for Apache Spark is on the roadmap and will be

available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the

transition from CascadingScalding to Spark a little

easier by adding support for Cascading Taps Scalding

Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark

38

Apache Crunch

bull The Apache Crunch Java library provides a

framework for writing testing and running

MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a

SparkPipeline class making it easy to migrate

data processing applications from MapReduce

to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark

Pipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-

xtopicscdh_ig_running_crunch_with_sparkhtml

39

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce

Apache Mahout the original Machine Learning (ML)

library for Hadoop since 2009 is rejecting new

MapReduce algorithm

implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark

bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this

DSL are automatically optimized and executed in

parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for

Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

40

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov

April 2014

httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with

Mahout Scala and Spark Published on May 30 2014

httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-

with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)-

MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

41

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

42

3 IntegrationService Open Source Tool

StorageServi

ng Layer

Data Formats

Data

Ingestion

Services

Resource

Management

Search

SQL

43

3 Integration

bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851

44

3 Integration

bull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via

newAPIHadoopRDD Example HBaseTestscala from

Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach

esparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available

for reading from and writing to HBase without the need

of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with

Spark Status Still in experimentation and no timetable for

possible support httpblogclouderacomblog201412new-in-cloudera-

labs-sparkonhbase

45

3 Integration

bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark

RDDs to Cassandra tables and execute arbitrary CQL

queries in your Spark applications Supports also

integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration

is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra

46

3 Integration

bull Benchmark of Spark amp Cassandra Integration

using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume

data from Cassandra to spark and store Resilient

Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new

avenues

bull Kindling An Introduction to Spark with Cassandra

(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-

spark-with-cassandra

47

3 Integration

bull MongoDB is not directly served by Spark although

it can be used from Spark via an official Mongo-

Hadoop connector

bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-

insights

bull Spark SQL also provides indirect support via its

support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

48

3 Integration

bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from

Apache Spark (still experimental)

bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-

introduction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-

example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-

example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without

Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

49

3 Integration

bull Neo4j is a highly scalable robust (fully ACID) native graph

database

bull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015

httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015

httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph

Analytics By Kenny Bastani November 3 2014

httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

50

3 Integration YARN

bull YARN Yet Another Resource Negotiator Implicit

reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND

20summary20~20yarn20AND20status203D20OPEN20ORDER20

BY20priority20DESC0A

bull Some issues are critical ones

bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

51

3 Integration

bull Spark SQL provides built in support for Hivetables

bull Import relational data from Hive tables

bull Run SQL queries over imported data

bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120

bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib

52

3 Integration

bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to

address new use cases

bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data

sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query

in-memory data in Spark Embed Drill execution in a

Spark data pipeline

Source Whats Coming in 2015 for

Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

53

3 Integration

bull Apache Kafka is a high throughput distributed

messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka

Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming

Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-

example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka

54

3 Integration

bull Apache Flume is a streaming event data

ingestion system that is designed for Big Data

ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with

Flume There are two approaches to this

bull Approach 1 Flume-style Push-based Approach

bull Approach 2 (Experimental) Pull-based

Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

55

3 Integration

bull Spark SQL provides built in support for JSON that

is vastly simplifying the end-to-end-experience of

working with JSON data

bull Spark SQL can automatically infer the schema

of a JSON dataset and load it as a

SchemaRDD No more DDL Just point Spark

SQL to JSON files and query Starting Spark 13

SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-

support-in-spark-sqlhtml

56

3 Integration

bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows to

bull Import relational data from Parquet files

bull Run SQL queries over imported data

bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

57

3 Integration

bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015

bull Problem

bull Various inbound data sets

bull Data Layout can change without notice

bull New data sets can be added without notice

Result

bull Leverage Spark to dynamically split the data

bull Leverage Avro to store the data in a compact binary format

58

3 Integration Kite SDK

bull The Kite SDK provides high level abstractions to

work with datasets on Hadoop hiding many of

the details of compression codecs file formats

partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016

release so Spark jobs can read and write to Kite

datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

59

3 Integration

bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between

Elasticsearch and Apache Spark in the form of RDD that can

read data from Elasticsearch Also any RDD can be saved to

Elasticsearch as long as its content can be translated into

documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache

Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html

60

3 Integration

bull Apache Solr added a Spark-based indexing tool for

fast and easy indexing ingestion and serving

searchable complex data ldquoCrunchIndexerTool on

Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark

Crunch and Morphlines

bull Migrate ingestion of HDFS data into Solr from

MapReduce to Spark

bull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-

intosolrusingsparktrimmed

61

3 Integration

bull HUE is the open source Apache Hadoop Web UI

that lets users use Hadoop directly from their

browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark

Igniter lets users execute and monitor Spark jobs

directly from their browser and be more

productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-

hadoop-by-enrico-berti-at-big-data-spain-2014

62

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

63

4 Complementarity

Components of Hadoop ecosystem and Spark ecosystem

can work together each for what it is especially good at

rather than choosing one of them

64

Hadoop ecosystem Spark ecosystem

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

65

4 Complementarity +

bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesos

bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41

66

4 Complementarity +

References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN

cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache

Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-

resource-management

bull YARN vs MESOS Canrsquot We All Just Get

Along httpstrataconfcombig-data-conference-ca-

2015publicscheduledetail40620

67

4 Complementarity +

bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization

strategies (building the DAG with knowledge of data

distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the

need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with

YARN (resource chaining in clusters)

bull Tez supports enterprise security

68

4 Complementarity +

bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better

since it is more ldquostream orientedrdquo has more mature

shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory

parsed data it can be much better when we process

data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native

YARN Integration httphortonworkscomblogimproving-spark-data-

pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

69

4 Complementarity

bull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal

compute framework at each step in the big data

analytics process based on the type of platform the

attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on

November 13 2014 with Matt Schumpert Director of Product

Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution

Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-

right-execution-enginehtml

70

4 Complementarity

bull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles

Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-

2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to

Mainstream Apache Hadoop Adoption February 12 2015

httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-

Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms

February 23 2015

httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-

migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015

httpblogsyncsortcom201503framework-future-hadoop

71

5 Key Takeaways1 Evolution of compute models is still ongoing

Watch out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

72

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

73

1 File System

Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml

bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift

74

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 16: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

1 Big Databull Big Data is still one of the most inflated buzzword of

the last years

bull Big Data is a broad term for data sets so large or

complex that traditional data processing tools are

inadequate httpenwikipediaorgwikiBig_data

bull Hadoop is becoming a traditional tool Above

definition is inadequate

bull ldquoBig Data refers to datasets and flows large enough

that has outpaced our capability to store process

analyze and understandrdquo Amir H Payberah

Swedish Institute of Computer Science (SICS)

16

2 Typical Big Data Stack

17

3 Apache Hadoopbull Apache Hadoop as an example of a Typical Big Data

Stack

bull Hadoop ecosystem = Hadoop Stack + many other tools

(either open source and free or commercial ones)

bull Big Data Ecosystem Dataset httpbigdataandreamostosiname

Incomplete but a useful list of Big Data related projects

packed into a JSON dataset

bull Hadoops Impact on Data Managements Future - Amr

Awadallah (Strata + Hadoop 2015) February 19 2015 Watch

video at 236 on lsquoHadoop Isnrsquot Just Hadoop Anymorersquo for a picture

representing the evolution of Apache Hadoop

httpswwwyoutubecomwatchv=1KvTZZAkHy0

18

4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stack

bull Apache Spark provides you Big Data computing and more

bull BYOS Bring Your Own Storage

bull BYOC Bring Your Own Cluster

bull Spark Core httpsparkbigdatacomcomponenttagstag11-core-spark

bull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-streaming

bull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sql

bull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllib

bull GraphX httpsparkbigdatacomcomponenttagstag6-graphx

bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned

19

5 Key Takeaways

1 Big Data Still one of the most inflated

buzzword

2 Typical Big Data Stack Big Data Stacks look

similar on paper Arenrsquot they

3 Apache Hadoop Hadoop is no longer

lsquosynonymousrsquo of Big Data

4 Apache Spark Emergence of the Apache

Spark ecosystem

20

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

21

1 Evolution of Programming APIs

bull MapReduce in Java is like assembly code of Big

Data httpwikiapacheorghadoopWordCount

bull Pig httppigapacheorg

bull Hive httphiveapacheorg

bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi

bull Cascading httpwwwcascadingorg

bull Scalding A Scala API for Cascading httptwittercomscalding

bull Crunch httpcrunchapacheorg

bull Scrunch httpcrunchapacheorgscrunchhtml

22

1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007

MapReduce v1 was the only choice as a compute model

(Execution Engine) on Hadoop Now we have in addition

to MapReduce v2 Tez Spark and Flink

23

bull Batch bull Batch

bull Interactive

bull Batch

bull Interactive

bull Near-Real

time

bull Batch

bull Interactive

bull Real-Time

bull Iterative

bull 1st

Generation

bull 2nd

Generation

bull 3rd

Generation

bull 4th

Generation

1 Evolution

bull This is how Hadoop MapReduce is branding itself ldquoA YARN-based system for parallel processing of large data sets httphadoopapacheorg

bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip

bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job

bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics

24

1 Evolution

bull Tez Hindi for ldquospeedrdquo

bull This is how Apache Tez is branding itself ldquoTheApache Tez project is aimed at building anapplication framework which allows for a complexdirected-acyclic-graph of tasks for processingdata It is currently built atop YARNrdquo

Source httptezapacheorg

bull Apachetrade Tez is an extensible framework for

building high performance batch and

interactive data processing applicationscoordinated by YARN in Apache Hadoop

25

1 Evolution

bull lsquoSparkrsquo for lightning fast speed

bull This is how Apache Spark is branding itselfldquoApache Sparktrade is a fast and general engine forlarge-scale data processingrdquo httpssparkapacheorg

bull Apache Spark is a general purpose clustercomputing framework its execution modelsupports wide variety of use cases batchinteractive near-real time

bull The rapid in-memory processing of resilientdistributed datasets (RDDs) is the ldquocorecapabilityrdquo of Apache Spark

26

1 Evolution Apache Flink

bull Flink German for ldquonimble swift speedyrdquo

bull This is how Apache Flink is branding itself ldquoFast andreliable large-scale data processing enginerdquo

bull Apache Flink httpflinkapacheorg offers

bull Batch and Streaming in the same system

bull Beyond DAGs (Cyclic operator graphs)

bull Powerful expressive APIs

bull Inside-the-system iterations

bull Full Hadoop compatibility

bull Automatic language independent optimizer

bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink

27

Hadoop MapReduce vs Tez vs Spark

Criteria

License Open Source

Apache 20 version

2x

Open Source

Apache 20

version 0x

Open Source

Apache 20 version

1x

Processing

Model

On-Disk (Disk-

based

parallelization)

Batch

On-Disk Batch

Interactive

In-Memory On-Disk

Batch Interactive

Streaming (Near Real-

Time)

Language written

in

Java Java Scala

API [Java Python

Scala] User-Facing

Java[

ISVEngineTool

builder]

[Scala Java Python]

User-Facing

Libraries None separate tools None [Spark Core Spark

Streaming Spark SQL

MLlib GraphX]

28

Hadoop MapReduce vs Tez vs Spark

Criteria

Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to

Hadoop

Ease of Use Difficult to program

needs abstractions

No Interactive mode

except Hive Pig

Difficult to program

No Interactive

mode except Hive

Pig

Easy to program

no need of

abstractions

Interactive mode

Compatibilit

y

to data types and data

sources is same

to data types and

data sources is

same

to data types and

data sources is

same

YARN

integration

YARN application Ground up YARN

application

Spark is moving

towards YARN

29

Hadoop MapReduce vs Tez vs Spark

Criteria

Deployment YARN YARN [Standalone YARN

SIMR Mesos hellip]

Performance - Good performance

when data fits into

memory

- performance

degradation otherwise

Security More features and

projects

More

features and

projects

Still in its infancy

30

Partial support

IV Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

31

2 Transition

bull Existing Hadoop MapReduce projects can

migrate to Spark and leverage Spark Core as

execution engine

1 You can often reuse your mapper and

reducer functions and just call them in

Spark from Java or Scala

2 You can translate your code from

MapReduce to Apache Spark How-to

Translate from MapReduce to Apache Sparkhttpblogclouderacomblog201409how-to-translate-from-mapreduce-to-

apache-spark

32

2 Transition

3 The following tools originally based on Hadoop

MapReduce are being ported to Apache Spark

bull Pig Hive Sqoop Cascading Crunch Mahout hellip

33

Pig on Spark (Spork)

bull Run Pig with ldquondashx sparkrdquo option for an easy migration

without development effort

bull Speed up your existing pig scripts on Spark ( Query

Logical Plan Physical Pan)

bull Leverage new Spark specific operators in Pig such as

Cache

bull Still leverage many existing Pig UDF libraries

bull Pig on Spark Umbrella Jira (Status Passed end-to-end test

cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059

bull Fix outstanding issues and address additional Spark functionality

through the community

bull lsquoPig on Sparkrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag19

34

Hive on Spark (Currently in Beta

Expected in Hive 110)

bull New alternative to using MapReduce or Tez

hivegt set hiveexecutionengine=spark

bull Help existing Hive applications running on

MapReduce or Tez easily migrate to Spark without

development effort

bull Exposes Spark users to a viable feature-rich de facto

standard SQL tool on Hadoop

bull Performance benefits especially for Hive queries

involving multiple reducer stages

bull Hive on Spark Umbrella Jira (Status Open) Q1 2015httpsissuesapacheorgjirabrowseHIVE-7292

35

Hive on Spark (Currently in Beta

Expected in Hive 110)

bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-

motivations-and-design-principles

bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Start

ed

bull Hive on Spark February 11 2015 Szehon Ho

Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark

bull Hive on spark is blazing fast or is it Carter Shanklin and

Mostapah Mokhtar (Hortonworks) February 20 2015httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final

bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12

36

Sqoop on Spark

(Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially

developed as a tool to transfer data from RDBMS to

Hadoop

bull The next version of Sqoop referred to as Sqoop2

supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under

discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Pro

posal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira

Status Work In Progress) The goal of this ticket is to support a

pluggable way to select the execution engine on which we can run

the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

37

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application

development platform for building data applications on

Hadoop

bull Support for Apache Spark is on the roadmap and will be

available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the

transition from CascadingScalding to Spark a little

easier by adding support for Cascading Taps Scalding

Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark

38

Apache Crunch

bull The Apache Crunch Java library provides a

framework for writing testing and running

MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a

SparkPipeline class making it easy to migrate

data processing applications from MapReduce

to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark

Pipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-

xtopicscdh_ig_running_crunch_with_sparkhtml

39

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce

Apache Mahout the original Machine Learning (ML)

library for Hadoop since 2009 is rejecting new

MapReduce algorithm

implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark

bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this

DSL are automatically optimized and executed in

parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for

Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

40

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov

April 2014

httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with

Mahout Scala and Spark Published on May 30 2014

httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-

with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)-

MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

41

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

42

3 IntegrationService Open Source Tool

StorageServi

ng Layer

Data Formats

Data

Ingestion

Services

Resource

Management

Search

SQL

43

3 Integration

bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851

44

3 Integration

bull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via

newAPIHadoopRDD Example HBaseTestscala from

Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach

esparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available

for reading from and writing to HBase without the need

of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with

Spark Status Still in experimentation and no timetable for

possible support httpblogclouderacomblog201412new-in-cloudera-

labs-sparkonhbase

45

3 Integration

bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark

RDDs to Cassandra tables and execute arbitrary CQL

queries in your Spark applications Supports also

integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration

is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra

46

3 Integration

bull Benchmark of Spark amp Cassandra Integration

using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume

data from Cassandra to spark and store Resilient

Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new

avenues

bull Kindling An Introduction to Spark with Cassandra

(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-

spark-with-cassandra

47

3 Integration

bull MongoDB is not directly served by Spark although

it can be used from Spark via an official Mongo-

Hadoop connector

bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-

insights

bull Spark SQL also provides indirect support via its

support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

48

3 Integration

bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from

Apache Spark (still experimental)

bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-

introduction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-

example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-

example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without

Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

49

3 Integration

bull Neo4j is a highly scalable robust (fully ACID) native graph

database

bull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015

httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015

httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph

Analytics By Kenny Bastani November 3 2014

httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

50

3 Integration YARN

bull YARN Yet Another Resource Negotiator Implicit

reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND

20summary20~20yarn20AND20status203D20OPEN20ORDER20

BY20priority20DESC0A

bull Some issues are critical ones

bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

51

3 Integration

bull Spark SQL provides built in support for Hivetables

bull Import relational data from Hive tables

bull Run SQL queries over imported data

bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120

bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib

52

3 Integration

bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to

address new use cases

bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data

sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query

in-memory data in Spark Embed Drill execution in a

Spark data pipeline

Source Whats Coming in 2015 for

Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

53

3 Integration

bull Apache Kafka is a high throughput distributed

messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka

Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming

Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-

example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka

54

3 Integration

bull Apache Flume is a streaming event data

ingestion system that is designed for Big Data

ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with

Flume There are two approaches to this

bull Approach 1 Flume-style Push-based Approach

bull Approach 2 (Experimental) Pull-based

Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

55

3 Integration

bull Spark SQL provides built in support for JSON that

is vastly simplifying the end-to-end-experience of

working with JSON data

bull Spark SQL can automatically infer the schema

of a JSON dataset and load it as a

SchemaRDD No more DDL Just point Spark

SQL to JSON files and query Starting Spark 13

SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-

support-in-spark-sqlhtml

56

3 Integration

bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows to

bull Import relational data from Parquet files

bull Run SQL queries over imported data

bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

57

3 Integration

bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015

bull Problem

bull Various inbound data sets

bull Data Layout can change without notice

bull New data sets can be added without notice

Result

bull Leverage Spark to dynamically split the data

bull Leverage Avro to store the data in a compact binary format

58

3 Integration Kite SDK

bull The Kite SDK provides high level abstractions to

work with datasets on Hadoop hiding many of

the details of compression codecs file formats

partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016

release so Spark jobs can read and write to Kite

datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

59

3 Integration

bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between

Elasticsearch and Apache Spark in the form of RDD that can

read data from Elasticsearch Also any RDD can be saved to

Elasticsearch as long as its content can be translated into

documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache

Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html

60

3 Integration

bull Apache Solr added a Spark-based indexing tool for

fast and easy indexing ingestion and serving

searchable complex data ldquoCrunchIndexerTool on

Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark

Crunch and Morphlines

bull Migrate ingestion of HDFS data into Solr from

MapReduce to Spark

bull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-

intosolrusingsparktrimmed

61

3 Integration

bull HUE is the open source Apache Hadoop Web UI

that lets users use Hadoop directly from their

browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark

Igniter lets users execute and monitor Spark jobs

directly from their browser and be more

productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-

hadoop-by-enrico-berti-at-big-data-spain-2014

62

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

63

4 Complementarity

Components of Hadoop ecosystem and Spark ecosystem

can work together each for what it is especially good at

rather than choosing one of them

64

Hadoop ecosystem Spark ecosystem

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

65

4 Complementarity +

bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesos

bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41

66

4 Complementarity +

References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN

cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache

Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-

resource-management

bull YARN vs MESOS Canrsquot We All Just Get

Along httpstrataconfcombig-data-conference-ca-

2015publicscheduledetail40620

67

4 Complementarity +

bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization

strategies (building the DAG with knowledge of data

distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the

need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with

YARN (resource chaining in clusters)

bull Tez supports enterprise security

68

4 Complementarity +

bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better

since it is more ldquostream orientedrdquo has more mature

shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory

parsed data it can be much better when we process

data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native

YARN Integration httphortonworkscomblogimproving-spark-data-

pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

69

4 Complementarity

bull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal

compute framework at each step in the big data

analytics process based on the type of platform the

attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on

November 13 2014 with Matt Schumpert Director of Product

Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution

Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-

right-execution-enginehtml

70

4 Complementarity

bull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles

Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-

2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to

Mainstream Apache Hadoop Adoption February 12 2015

httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-

Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms

February 23 2015

httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-

migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015

httpblogsyncsortcom201503framework-future-hadoop

71

5 Key Takeaways1 Evolution of compute models is still ongoing

Watch out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

72

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

73

1 File System

Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml

bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift

74

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 17: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

2 Typical Big Data Stack

17

3 Apache Hadoopbull Apache Hadoop as an example of a Typical Big Data

Stack

bull Hadoop ecosystem = Hadoop Stack + many other tools

(either open source and free or commercial ones)

bull Big Data Ecosystem Dataset httpbigdataandreamostosiname

Incomplete but a useful list of Big Data related projects

packed into a JSON dataset

bull Hadoops Impact on Data Managements Future - Amr

Awadallah (Strata + Hadoop 2015) February 19 2015 Watch

video at 236 on lsquoHadoop Isnrsquot Just Hadoop Anymorersquo for a picture

representing the evolution of Apache Hadoop

httpswwwyoutubecomwatchv=1KvTZZAkHy0

18

4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stack

bull Apache Spark provides you Big Data computing and more

bull BYOS Bring Your Own Storage

bull BYOC Bring Your Own Cluster

bull Spark Core httpsparkbigdatacomcomponenttagstag11-core-spark

bull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-streaming

bull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sql

bull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllib

bull GraphX httpsparkbigdatacomcomponenttagstag6-graphx

bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned

19

5 Key Takeaways

1 Big Data Still one of the most inflated

buzzword

2 Typical Big Data Stack Big Data Stacks look

similar on paper Arenrsquot they

3 Apache Hadoop Hadoop is no longer

lsquosynonymousrsquo of Big Data

4 Apache Spark Emergence of the Apache

Spark ecosystem

20

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

21

1 Evolution of Programming APIs

bull MapReduce in Java is like assembly code of Big

Data httpwikiapacheorghadoopWordCount

bull Pig httppigapacheorg

bull Hive httphiveapacheorg

bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi

bull Cascading httpwwwcascadingorg

bull Scalding A Scala API for Cascading httptwittercomscalding

bull Crunch httpcrunchapacheorg

bull Scrunch httpcrunchapacheorgscrunchhtml

22

1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007

MapReduce v1 was the only choice as a compute model

(Execution Engine) on Hadoop Now we have in addition

to MapReduce v2 Tez Spark and Flink

23

bull Batch bull Batch

bull Interactive

bull Batch

bull Interactive

bull Near-Real

time

bull Batch

bull Interactive

bull Real-Time

bull Iterative

bull 1st

Generation

bull 2nd

Generation

bull 3rd

Generation

bull 4th

Generation

1 Evolution

bull This is how Hadoop MapReduce is branding itself ldquoA YARN-based system for parallel processing of large data sets httphadoopapacheorg

bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip

bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job

bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics

24

1 Evolution

bull Tez Hindi for ldquospeedrdquo

bull This is how Apache Tez is branding itself ldquoTheApache Tez project is aimed at building anapplication framework which allows for a complexdirected-acyclic-graph of tasks for processingdata It is currently built atop YARNrdquo

Source httptezapacheorg

bull Apachetrade Tez is an extensible framework for

building high performance batch and

interactive data processing applicationscoordinated by YARN in Apache Hadoop

25

1 Evolution

bull lsquoSparkrsquo for lightning fast speed

bull This is how Apache Spark is branding itselfldquoApache Sparktrade is a fast and general engine forlarge-scale data processingrdquo httpssparkapacheorg

bull Apache Spark is a general purpose clustercomputing framework its execution modelsupports wide variety of use cases batchinteractive near-real time

bull The rapid in-memory processing of resilientdistributed datasets (RDDs) is the ldquocorecapabilityrdquo of Apache Spark

26

1 Evolution Apache Flink

bull Flink German for ldquonimble swift speedyrdquo

bull This is how Apache Flink is branding itself ldquoFast andreliable large-scale data processing enginerdquo

bull Apache Flink httpflinkapacheorg offers

bull Batch and Streaming in the same system

bull Beyond DAGs (Cyclic operator graphs)

bull Powerful expressive APIs

bull Inside-the-system iterations

bull Full Hadoop compatibility

bull Automatic language independent optimizer

bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink

27

Hadoop MapReduce vs Tez vs Spark

Criteria

License Open Source

Apache 20 version

2x

Open Source

Apache 20

version 0x

Open Source

Apache 20 version

1x

Processing

Model

On-Disk (Disk-

based

parallelization)

Batch

On-Disk Batch

Interactive

In-Memory On-Disk

Batch Interactive

Streaming (Near Real-

Time)

Language written

in

Java Java Scala

API [Java Python

Scala] User-Facing

Java[

ISVEngineTool

builder]

[Scala Java Python]

User-Facing

Libraries None separate tools None [Spark Core Spark

Streaming Spark SQL

MLlib GraphX]

28

Hadoop MapReduce vs Tez vs Spark

Criteria

Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to

Hadoop

Ease of Use Difficult to program

needs abstractions

No Interactive mode

except Hive Pig

Difficult to program

No Interactive

mode except Hive

Pig

Easy to program

no need of

abstractions

Interactive mode

Compatibilit

y

to data types and data

sources is same

to data types and

data sources is

same

to data types and

data sources is

same

YARN

integration

YARN application Ground up YARN

application

Spark is moving

towards YARN

29

Hadoop MapReduce vs Tez vs Spark

Criteria

Deployment YARN YARN [Standalone YARN

SIMR Mesos hellip]

Performance - Good performance

when data fits into

memory

- performance

degradation otherwise

Security More features and

projects

More

features and

projects

Still in its infancy

30

Partial support

IV Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

31

2 Transition

bull Existing Hadoop MapReduce projects can

migrate to Spark and leverage Spark Core as

execution engine

1 You can often reuse your mapper and

reducer functions and just call them in

Spark from Java or Scala

2 You can translate your code from

MapReduce to Apache Spark How-to

Translate from MapReduce to Apache Sparkhttpblogclouderacomblog201409how-to-translate-from-mapreduce-to-

apache-spark

32

2 Transition

3 The following tools originally based on Hadoop

MapReduce are being ported to Apache Spark

bull Pig Hive Sqoop Cascading Crunch Mahout hellip

33

Pig on Spark (Spork)

bull Run Pig with ldquondashx sparkrdquo option for an easy migration

without development effort

bull Speed up your existing pig scripts on Spark ( Query

Logical Plan Physical Pan)

bull Leverage new Spark specific operators in Pig such as

Cache

bull Still leverage many existing Pig UDF libraries

bull Pig on Spark Umbrella Jira (Status Passed end-to-end test

cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059

bull Fix outstanding issues and address additional Spark functionality

through the community

bull lsquoPig on Sparkrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag19

34

Hive on Spark (Currently in Beta

Expected in Hive 110)

bull New alternative to using MapReduce or Tez

hivegt set hiveexecutionengine=spark

bull Help existing Hive applications running on

MapReduce or Tez easily migrate to Spark without

development effort

bull Exposes Spark users to a viable feature-rich de facto

standard SQL tool on Hadoop

bull Performance benefits especially for Hive queries

involving multiple reducer stages

bull Hive on Spark Umbrella Jira (Status Open) Q1 2015httpsissuesapacheorgjirabrowseHIVE-7292

35

Hive on Spark (Currently in Beta

Expected in Hive 110)

bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-

motivations-and-design-principles

bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Start

ed

bull Hive on Spark February 11 2015 Szehon Ho

Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark

bull Hive on spark is blazing fast or is it Carter Shanklin and

Mostapah Mokhtar (Hortonworks) February 20 2015httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final

bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12

36

Sqoop on Spark

(Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially

developed as a tool to transfer data from RDBMS to

Hadoop

bull The next version of Sqoop referred to as Sqoop2

supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under

discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Pro

posal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira

Status Work In Progress) The goal of this ticket is to support a

pluggable way to select the execution engine on which we can run

the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

37

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application

development platform for building data applications on

Hadoop

bull Support for Apache Spark is on the roadmap and will be

available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the

transition from CascadingScalding to Spark a little

easier by adding support for Cascading Taps Scalding

Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark

38

Apache Crunch

bull The Apache Crunch Java library provides a

framework for writing testing and running

MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a

SparkPipeline class making it easy to migrate

data processing applications from MapReduce

to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark

Pipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-

xtopicscdh_ig_running_crunch_with_sparkhtml

39

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce

Apache Mahout the original Machine Learning (ML)

library for Hadoop since 2009 is rejecting new

MapReduce algorithm

implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark

bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this

DSL are automatically optimized and executed in

parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for

Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

40

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov

April 2014

httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with

Mahout Scala and Spark Published on May 30 2014

httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-

with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)-

MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

41

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

42

3 IntegrationService Open Source Tool

StorageServi

ng Layer

Data Formats

Data

Ingestion

Services

Resource

Management

Search

SQL

43

3 Integration

bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851

44

3 Integration

bull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via

newAPIHadoopRDD Example HBaseTestscala from

Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach

esparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available

for reading from and writing to HBase without the need

of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with

Spark Status Still in experimentation and no timetable for

possible support httpblogclouderacomblog201412new-in-cloudera-

labs-sparkonhbase

45

3 Integration

bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark

RDDs to Cassandra tables and execute arbitrary CQL

queries in your Spark applications Supports also

integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration

is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra

46

3 Integration

bull Benchmark of Spark amp Cassandra Integration

using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume

data from Cassandra to spark and store Resilient

Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new

avenues

bull Kindling An Introduction to Spark with Cassandra

(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-

spark-with-cassandra

47

3 Integration

bull MongoDB is not directly served by Spark although

it can be used from Spark via an official Mongo-

Hadoop connector

bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-

insights

bull Spark SQL also provides indirect support via its

support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

48

3 Integration

bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from

Apache Spark (still experimental)

bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-

introduction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-

example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-

example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without

Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

49

3 Integration

bull Neo4j is a highly scalable robust (fully ACID) native graph

database

bull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015

httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015

httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph

Analytics By Kenny Bastani November 3 2014

httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

50

3 Integration YARN

bull YARN Yet Another Resource Negotiator Implicit

reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND

20summary20~20yarn20AND20status203D20OPEN20ORDER20

BY20priority20DESC0A

bull Some issues are critical ones

bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

51

3 Integration

bull Spark SQL provides built in support for Hivetables

bull Import relational data from Hive tables

bull Run SQL queries over imported data

bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120

bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib

52

3 Integration

bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to

address new use cases

bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data

sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query

in-memory data in Spark Embed Drill execution in a

Spark data pipeline

Source Whats Coming in 2015 for

Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

53

3 Integration

bull Apache Kafka is a high throughput distributed

messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka

Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming

Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-

example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka

54

3 Integration

bull Apache Flume is a streaming event data

ingestion system that is designed for Big Data

ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with

Flume There are two approaches to this

bull Approach 1 Flume-style Push-based Approach

bull Approach 2 (Experimental) Pull-based

Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

55

3 Integration

bull Spark SQL provides built in support for JSON that

is vastly simplifying the end-to-end-experience of

working with JSON data

bull Spark SQL can automatically infer the schema

of a JSON dataset and load it as a

SchemaRDD No more DDL Just point Spark

SQL to JSON files and query Starting Spark 13

SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-

support-in-spark-sqlhtml

56

3 Integration

bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows to

bull Import relational data from Parquet files

bull Run SQL queries over imported data

bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

57

3 Integration

bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015

bull Problem

bull Various inbound data sets

bull Data Layout can change without notice

bull New data sets can be added without notice

Result

bull Leverage Spark to dynamically split the data

bull Leverage Avro to store the data in a compact binary format

58

3 Integration Kite SDK

bull The Kite SDK provides high level abstractions to

work with datasets on Hadoop hiding many of

the details of compression codecs file formats

partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016

release so Spark jobs can read and write to Kite

datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

59

3 Integration

bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between

Elasticsearch and Apache Spark in the form of RDD that can

read data from Elasticsearch Also any RDD can be saved to

Elasticsearch as long as its content can be translated into

documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache

Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html

60

3 Integration

bull Apache Solr added a Spark-based indexing tool for

fast and easy indexing ingestion and serving

searchable complex data ldquoCrunchIndexerTool on

Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark

Crunch and Morphlines

bull Migrate ingestion of HDFS data into Solr from

MapReduce to Spark

bull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-

intosolrusingsparktrimmed

61

3 Integration

bull HUE is the open source Apache Hadoop Web UI

that lets users use Hadoop directly from their

browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark

Igniter lets users execute and monitor Spark jobs

directly from their browser and be more

productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-

hadoop-by-enrico-berti-at-big-data-spain-2014

62

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

63

4 Complementarity

Components of Hadoop ecosystem and Spark ecosystem

can work together each for what it is especially good at

rather than choosing one of them

64

Hadoop ecosystem Spark ecosystem

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

65

4 Complementarity +

bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesos

bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41

66

4 Complementarity +

References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN

cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache

Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-

resource-management

bull YARN vs MESOS Canrsquot We All Just Get

Along httpstrataconfcombig-data-conference-ca-

2015publicscheduledetail40620

67

4 Complementarity +

bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization

strategies (building the DAG with knowledge of data

distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the

need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with

YARN (resource chaining in clusters)

bull Tez supports enterprise security

68

4 Complementarity +

bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better

since it is more ldquostream orientedrdquo has more mature

shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory

parsed data it can be much better when we process

data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native

YARN Integration httphortonworkscomblogimproving-spark-data-

pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

69

4 Complementarity

bull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal

compute framework at each step in the big data

analytics process based on the type of platform the

attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on

November 13 2014 with Matt Schumpert Director of Product

Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution

Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-

right-execution-enginehtml

70

4 Complementarity

bull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles

Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-

2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to

Mainstream Apache Hadoop Adoption February 12 2015

httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-

Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms

February 23 2015

httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-

migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015

httpblogsyncsortcom201503framework-future-hadoop

71

5 Key Takeaways1 Evolution of compute models is still ongoing

Watch out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

72

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

73

1 File System

Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml

bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift

74

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 18: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

3 Apache Hadoopbull Apache Hadoop as an example of a Typical Big Data

Stack

bull Hadoop ecosystem = Hadoop Stack + many other tools

(either open source and free or commercial ones)

bull Big Data Ecosystem Dataset httpbigdataandreamostosiname

Incomplete but a useful list of Big Data related projects

packed into a JSON dataset

bull Hadoops Impact on Data Managements Future - Amr

Awadallah (Strata + Hadoop 2015) February 19 2015 Watch

video at 236 on lsquoHadoop Isnrsquot Just Hadoop Anymorersquo for a picture

representing the evolution of Apache Hadoop

httpswwwyoutubecomwatchv=1KvTZZAkHy0

18

4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stack

bull Apache Spark provides you Big Data computing and more

bull BYOS Bring Your Own Storage

bull BYOC Bring Your Own Cluster

bull Spark Core httpsparkbigdatacomcomponenttagstag11-core-spark

bull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-streaming

bull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sql

bull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllib

bull GraphX httpsparkbigdatacomcomponenttagstag6-graphx

bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned

19

5 Key Takeaways

1 Big Data Still one of the most inflated

buzzword

2 Typical Big Data Stack Big Data Stacks look

similar on paper Arenrsquot they

3 Apache Hadoop Hadoop is no longer

lsquosynonymousrsquo of Big Data

4 Apache Spark Emergence of the Apache

Spark ecosystem

20

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

21

1 Evolution of Programming APIs

bull MapReduce in Java is like assembly code of Big

Data httpwikiapacheorghadoopWordCount

bull Pig httppigapacheorg

bull Hive httphiveapacheorg

bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi

bull Cascading httpwwwcascadingorg

bull Scalding A Scala API for Cascading httptwittercomscalding

bull Crunch httpcrunchapacheorg

bull Scrunch httpcrunchapacheorgscrunchhtml

22

1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007

MapReduce v1 was the only choice as a compute model

(Execution Engine) on Hadoop Now we have in addition

to MapReduce v2 Tez Spark and Flink

23

bull Batch bull Batch

bull Interactive

bull Batch

bull Interactive

bull Near-Real

time

bull Batch

bull Interactive

bull Real-Time

bull Iterative

bull 1st

Generation

bull 2nd

Generation

bull 3rd

Generation

bull 4th

Generation

1 Evolution

bull This is how Hadoop MapReduce is branding itself ldquoA YARN-based system for parallel processing of large data sets httphadoopapacheorg

bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip

bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job

bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics

24

1 Evolution

bull Tez Hindi for ldquospeedrdquo

bull This is how Apache Tez is branding itself ldquoTheApache Tez project is aimed at building anapplication framework which allows for a complexdirected-acyclic-graph of tasks for processingdata It is currently built atop YARNrdquo

Source httptezapacheorg

bull Apachetrade Tez is an extensible framework for

building high performance batch and

interactive data processing applicationscoordinated by YARN in Apache Hadoop

25

1 Evolution

bull lsquoSparkrsquo for lightning fast speed

bull This is how Apache Spark is branding itselfldquoApache Sparktrade is a fast and general engine forlarge-scale data processingrdquo httpssparkapacheorg

bull Apache Spark is a general purpose clustercomputing framework its execution modelsupports wide variety of use cases batchinteractive near-real time

bull The rapid in-memory processing of resilientdistributed datasets (RDDs) is the ldquocorecapabilityrdquo of Apache Spark

26

1 Evolution Apache Flink

bull Flink German for ldquonimble swift speedyrdquo

bull This is how Apache Flink is branding itself ldquoFast andreliable large-scale data processing enginerdquo

bull Apache Flink httpflinkapacheorg offers

bull Batch and Streaming in the same system

bull Beyond DAGs (Cyclic operator graphs)

bull Powerful expressive APIs

bull Inside-the-system iterations

bull Full Hadoop compatibility

bull Automatic language independent optimizer

bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink

27

Hadoop MapReduce vs Tez vs Spark

Criteria

License Open Source

Apache 20 version

2x

Open Source

Apache 20

version 0x

Open Source

Apache 20 version

1x

Processing

Model

On-Disk (Disk-

based

parallelization)

Batch

On-Disk Batch

Interactive

In-Memory On-Disk

Batch Interactive

Streaming (Near Real-

Time)

Language written

in

Java Java Scala

API [Java Python

Scala] User-Facing

Java[

ISVEngineTool

builder]

[Scala Java Python]

User-Facing

Libraries None separate tools None [Spark Core Spark

Streaming Spark SQL

MLlib GraphX]

28

Hadoop MapReduce vs Tez vs Spark

Criteria

Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to

Hadoop

Ease of Use Difficult to program

needs abstractions

No Interactive mode

except Hive Pig

Difficult to program

No Interactive

mode except Hive

Pig

Easy to program

no need of

abstractions

Interactive mode

Compatibilit

y

to data types and data

sources is same

to data types and

data sources is

same

to data types and

data sources is

same

YARN

integration

YARN application Ground up YARN

application

Spark is moving

towards YARN

29

Hadoop MapReduce vs Tez vs Spark

Criteria

Deployment YARN YARN [Standalone YARN

SIMR Mesos hellip]

Performance - Good performance

when data fits into

memory

- performance

degradation otherwise

Security More features and

projects

More

features and

projects

Still in its infancy

30

Partial support

IV Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

31

2 Transition

bull Existing Hadoop MapReduce projects can

migrate to Spark and leverage Spark Core as

execution engine

1 You can often reuse your mapper and

reducer functions and just call them in

Spark from Java or Scala

2 You can translate your code from

MapReduce to Apache Spark How-to

Translate from MapReduce to Apache Sparkhttpblogclouderacomblog201409how-to-translate-from-mapreduce-to-

apache-spark

32

2 Transition

3 The following tools originally based on Hadoop

MapReduce are being ported to Apache Spark

bull Pig Hive Sqoop Cascading Crunch Mahout hellip

33

Pig on Spark (Spork)

bull Run Pig with ldquondashx sparkrdquo option for an easy migration

without development effort

bull Speed up your existing pig scripts on Spark ( Query

Logical Plan Physical Pan)

bull Leverage new Spark specific operators in Pig such as

Cache

bull Still leverage many existing Pig UDF libraries

bull Pig on Spark Umbrella Jira (Status Passed end-to-end test

cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059

bull Fix outstanding issues and address additional Spark functionality

through the community

bull lsquoPig on Sparkrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag19

34

Hive on Spark (Currently in Beta

Expected in Hive 110)

bull New alternative to using MapReduce or Tez

hivegt set hiveexecutionengine=spark

bull Help existing Hive applications running on

MapReduce or Tez easily migrate to Spark without

development effort

bull Exposes Spark users to a viable feature-rich de facto

standard SQL tool on Hadoop

bull Performance benefits especially for Hive queries

involving multiple reducer stages

bull Hive on Spark Umbrella Jira (Status Open) Q1 2015httpsissuesapacheorgjirabrowseHIVE-7292

35

Hive on Spark (Currently in Beta

Expected in Hive 110)

bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-

motivations-and-design-principles

bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Start

ed

bull Hive on Spark February 11 2015 Szehon Ho

Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark

bull Hive on spark is blazing fast or is it Carter Shanklin and

Mostapah Mokhtar (Hortonworks) February 20 2015httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final

bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12

36

Sqoop on Spark

(Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially

developed as a tool to transfer data from RDBMS to

Hadoop

bull The next version of Sqoop referred to as Sqoop2

supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under

discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Pro

posal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira

Status Work In Progress) The goal of this ticket is to support a

pluggable way to select the execution engine on which we can run

the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

37

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application

development platform for building data applications on

Hadoop

bull Support for Apache Spark is on the roadmap and will be

available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the

transition from CascadingScalding to Spark a little

easier by adding support for Cascading Taps Scalding

Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark

38

Apache Crunch

bull The Apache Crunch Java library provides a

framework for writing testing and running

MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a

SparkPipeline class making it easy to migrate

data processing applications from MapReduce

to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark

Pipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-

xtopicscdh_ig_running_crunch_with_sparkhtml

39

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce

Apache Mahout the original Machine Learning (ML)

library for Hadoop since 2009 is rejecting new

MapReduce algorithm

implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark

bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this

DSL are automatically optimized and executed in

parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for

Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

40

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov

April 2014

httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with

Mahout Scala and Spark Published on May 30 2014

httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-

with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)-

MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

41

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

42

3 IntegrationService Open Source Tool

StorageServi

ng Layer

Data Formats

Data

Ingestion

Services

Resource

Management

Search

SQL

43

3 Integration

bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851

44

3 Integration

bull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via

newAPIHadoopRDD Example HBaseTestscala from

Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach

esparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available

for reading from and writing to HBase without the need

of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with

Spark Status Still in experimentation and no timetable for

possible support httpblogclouderacomblog201412new-in-cloudera-

labs-sparkonhbase

45

3 Integration

bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark

RDDs to Cassandra tables and execute arbitrary CQL

queries in your Spark applications Supports also

integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration

is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra

46

3 Integration

bull Benchmark of Spark amp Cassandra Integration

using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume

data from Cassandra to spark and store Resilient

Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new

avenues

bull Kindling An Introduction to Spark with Cassandra

(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-

spark-with-cassandra

47

3 Integration

bull MongoDB is not directly served by Spark although

it can be used from Spark via an official Mongo-

Hadoop connector

bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-

insights

bull Spark SQL also provides indirect support via its

support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

48

3 Integration

bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from

Apache Spark (still experimental)

bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-

introduction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-

example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-

example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without

Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

49

3 Integration

bull Neo4j is a highly scalable robust (fully ACID) native graph

database

bull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015

httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015

httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph

Analytics By Kenny Bastani November 3 2014

httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

50

3 Integration YARN

bull YARN Yet Another Resource Negotiator Implicit

reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND

20summary20~20yarn20AND20status203D20OPEN20ORDER20

BY20priority20DESC0A

bull Some issues are critical ones

bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

51

3 Integration

bull Spark SQL provides built in support for Hivetables

bull Import relational data from Hive tables

bull Run SQL queries over imported data

bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120

bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib

52

3 Integration

bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to

address new use cases

bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data

sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query

in-memory data in Spark Embed Drill execution in a

Spark data pipeline

Source Whats Coming in 2015 for

Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

53

3 Integration

bull Apache Kafka is a high throughput distributed

messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka

Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming

Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-

example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka

54

3 Integration

bull Apache Flume is a streaming event data

ingestion system that is designed for Big Data

ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with

Flume There are two approaches to this

bull Approach 1 Flume-style Push-based Approach

bull Approach 2 (Experimental) Pull-based

Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

55

3 Integration

bull Spark SQL provides built in support for JSON that

is vastly simplifying the end-to-end-experience of

working with JSON data

bull Spark SQL can automatically infer the schema

of a JSON dataset and load it as a

SchemaRDD No more DDL Just point Spark

SQL to JSON files and query Starting Spark 13

SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-

support-in-spark-sqlhtml

56

3 Integration

bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows to

bull Import relational data from Parquet files

bull Run SQL queries over imported data

bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

57

3 Integration

bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015

bull Problem

bull Various inbound data sets

bull Data Layout can change without notice

bull New data sets can be added without notice

Result

bull Leverage Spark to dynamically split the data

bull Leverage Avro to store the data in a compact binary format

58

3 Integration Kite SDK

bull The Kite SDK provides high level abstractions to

work with datasets on Hadoop hiding many of

the details of compression codecs file formats

partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016

release so Spark jobs can read and write to Kite

datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

59

3 Integration

bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between

Elasticsearch and Apache Spark in the form of RDD that can

read data from Elasticsearch Also any RDD can be saved to

Elasticsearch as long as its content can be translated into

documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache

Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html

60

3 Integration

bull Apache Solr added a Spark-based indexing tool for

fast and easy indexing ingestion and serving

searchable complex data ldquoCrunchIndexerTool on

Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark

Crunch and Morphlines

bull Migrate ingestion of HDFS data into Solr from

MapReduce to Spark

bull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-

intosolrusingsparktrimmed

61

3 Integration

bull HUE is the open source Apache Hadoop Web UI

that lets users use Hadoop directly from their

browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark

Igniter lets users execute and monitor Spark jobs

directly from their browser and be more

productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-

hadoop-by-enrico-berti-at-big-data-spain-2014

62

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

63

4 Complementarity

Components of Hadoop ecosystem and Spark ecosystem

can work together each for what it is especially good at

rather than choosing one of them

64

Hadoop ecosystem Spark ecosystem

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

65

4 Complementarity +

bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesos

bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41

66

4 Complementarity +

References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN

cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache

Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-

resource-management

bull YARN vs MESOS Canrsquot We All Just Get

Along httpstrataconfcombig-data-conference-ca-

2015publicscheduledetail40620

67

4 Complementarity +

bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization

strategies (building the DAG with knowledge of data

distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the

need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with

YARN (resource chaining in clusters)

bull Tez supports enterprise security

68

4 Complementarity +

bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better

since it is more ldquostream orientedrdquo has more mature

shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory

parsed data it can be much better when we process

data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native

YARN Integration httphortonworkscomblogimproving-spark-data-

pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

69

4 Complementarity

bull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal

compute framework at each step in the big data

analytics process based on the type of platform the

attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on

November 13 2014 with Matt Schumpert Director of Product

Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution

Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-

right-execution-enginehtml

70

4 Complementarity

bull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles

Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-

2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to

Mainstream Apache Hadoop Adoption February 12 2015

httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-

Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms

February 23 2015

httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-

migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015

httpblogsyncsortcom201503framework-future-hadoop

71

5 Key Takeaways1 Evolution of compute models is still ongoing

Watch out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

72

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

73

1 File System

Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml

bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift

74

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 19: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stack

bull Apache Spark provides you Big Data computing and more

bull BYOS Bring Your Own Storage

bull BYOC Bring Your Own Cluster

bull Spark Core httpsparkbigdatacomcomponenttagstag11-core-spark

bull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-streaming

bull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sql

bull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllib

bull GraphX httpsparkbigdatacomcomponenttagstag6-graphx

bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned

19

5 Key Takeaways

1 Big Data Still one of the most inflated

buzzword

2 Typical Big Data Stack Big Data Stacks look

similar on paper Arenrsquot they

3 Apache Hadoop Hadoop is no longer

lsquosynonymousrsquo of Big Data

4 Apache Spark Emergence of the Apache

Spark ecosystem

20

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

21

1 Evolution of Programming APIs

bull MapReduce in Java is like assembly code of Big

Data httpwikiapacheorghadoopWordCount

bull Pig httppigapacheorg

bull Hive httphiveapacheorg

bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi

bull Cascading httpwwwcascadingorg

bull Scalding A Scala API for Cascading httptwittercomscalding

bull Crunch httpcrunchapacheorg

bull Scrunch httpcrunchapacheorgscrunchhtml

22

1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007

MapReduce v1 was the only choice as a compute model

(Execution Engine) on Hadoop Now we have in addition

to MapReduce v2 Tez Spark and Flink

23

bull Batch bull Batch

bull Interactive

bull Batch

bull Interactive

bull Near-Real

time

bull Batch

bull Interactive

bull Real-Time

bull Iterative

bull 1st

Generation

bull 2nd

Generation

bull 3rd

Generation

bull 4th

Generation

1 Evolution

bull This is how Hadoop MapReduce is branding itself ldquoA YARN-based system for parallel processing of large data sets httphadoopapacheorg

bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip

bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job

bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics

24

1 Evolution

bull Tez Hindi for ldquospeedrdquo

bull This is how Apache Tez is branding itself ldquoTheApache Tez project is aimed at building anapplication framework which allows for a complexdirected-acyclic-graph of tasks for processingdata It is currently built atop YARNrdquo

Source httptezapacheorg

bull Apachetrade Tez is an extensible framework for

building high performance batch and

interactive data processing applicationscoordinated by YARN in Apache Hadoop

25

1 Evolution

bull lsquoSparkrsquo for lightning fast speed

bull This is how Apache Spark is branding itselfldquoApache Sparktrade is a fast and general engine forlarge-scale data processingrdquo httpssparkapacheorg

bull Apache Spark is a general purpose clustercomputing framework its execution modelsupports wide variety of use cases batchinteractive near-real time

bull The rapid in-memory processing of resilientdistributed datasets (RDDs) is the ldquocorecapabilityrdquo of Apache Spark

26

1 Evolution Apache Flink

bull Flink German for ldquonimble swift speedyrdquo

bull This is how Apache Flink is branding itself ldquoFast andreliable large-scale data processing enginerdquo

bull Apache Flink httpflinkapacheorg offers

bull Batch and Streaming in the same system

bull Beyond DAGs (Cyclic operator graphs)

bull Powerful expressive APIs

bull Inside-the-system iterations

bull Full Hadoop compatibility

bull Automatic language independent optimizer

bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink

27

Hadoop MapReduce vs Tez vs Spark

Criteria

License Open Source

Apache 20 version

2x

Open Source

Apache 20

version 0x

Open Source

Apache 20 version

1x

Processing

Model

On-Disk (Disk-

based

parallelization)

Batch

On-Disk Batch

Interactive

In-Memory On-Disk

Batch Interactive

Streaming (Near Real-

Time)

Language written

in

Java Java Scala

API [Java Python

Scala] User-Facing

Java[

ISVEngineTool

builder]

[Scala Java Python]

User-Facing

Libraries None separate tools None [Spark Core Spark

Streaming Spark SQL

MLlib GraphX]

28

Hadoop MapReduce vs Tez vs Spark

Criteria

Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to

Hadoop

Ease of Use Difficult to program

needs abstractions

No Interactive mode

except Hive Pig

Difficult to program

No Interactive

mode except Hive

Pig

Easy to program

no need of

abstractions

Interactive mode

Compatibilit

y

to data types and data

sources is same

to data types and

data sources is

same

to data types and

data sources is

same

YARN

integration

YARN application Ground up YARN

application

Spark is moving

towards YARN

29

Hadoop MapReduce vs Tez vs Spark

Criteria

Deployment YARN YARN [Standalone YARN

SIMR Mesos hellip]

Performance - Good performance

when data fits into

memory

- performance

degradation otherwise

Security More features and

projects

More

features and

projects

Still in its infancy

30

Partial support

IV Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

31

2 Transition

bull Existing Hadoop MapReduce projects can

migrate to Spark and leverage Spark Core as

execution engine

1 You can often reuse your mapper and

reducer functions and just call them in

Spark from Java or Scala

2 You can translate your code from

MapReduce to Apache Spark How-to

Translate from MapReduce to Apache Sparkhttpblogclouderacomblog201409how-to-translate-from-mapreduce-to-

apache-spark

32

2 Transition

3 The following tools originally based on Hadoop

MapReduce are being ported to Apache Spark

bull Pig Hive Sqoop Cascading Crunch Mahout hellip

33

Pig on Spark (Spork)

bull Run Pig with ldquondashx sparkrdquo option for an easy migration

without development effort

bull Speed up your existing pig scripts on Spark ( Query

Logical Plan Physical Pan)

bull Leverage new Spark specific operators in Pig such as

Cache

bull Still leverage many existing Pig UDF libraries

bull Pig on Spark Umbrella Jira (Status Passed end-to-end test

cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059

bull Fix outstanding issues and address additional Spark functionality

through the community

bull lsquoPig on Sparkrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag19

34

Hive on Spark (Currently in Beta

Expected in Hive 110)

bull New alternative to using MapReduce or Tez

hivegt set hiveexecutionengine=spark

bull Help existing Hive applications running on

MapReduce or Tez easily migrate to Spark without

development effort

bull Exposes Spark users to a viable feature-rich de facto

standard SQL tool on Hadoop

bull Performance benefits especially for Hive queries

involving multiple reducer stages

bull Hive on Spark Umbrella Jira (Status Open) Q1 2015httpsissuesapacheorgjirabrowseHIVE-7292

35

Hive on Spark (Currently in Beta

Expected in Hive 110)

bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-

motivations-and-design-principles

bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Start

ed

bull Hive on Spark February 11 2015 Szehon Ho

Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark

bull Hive on spark is blazing fast or is it Carter Shanklin and

Mostapah Mokhtar (Hortonworks) February 20 2015httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final

bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12

36

Sqoop on Spark

(Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially

developed as a tool to transfer data from RDBMS to

Hadoop

bull The next version of Sqoop referred to as Sqoop2

supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under

discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Pro

posal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira

Status Work In Progress) The goal of this ticket is to support a

pluggable way to select the execution engine on which we can run

the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

37

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application

development platform for building data applications on

Hadoop

bull Support for Apache Spark is on the roadmap and will be

available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the

transition from CascadingScalding to Spark a little

easier by adding support for Cascading Taps Scalding

Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark

38

Apache Crunch

bull The Apache Crunch Java library provides a

framework for writing testing and running

MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a

SparkPipeline class making it easy to migrate

data processing applications from MapReduce

to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark

Pipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-

xtopicscdh_ig_running_crunch_with_sparkhtml

39

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce

Apache Mahout the original Machine Learning (ML)

library for Hadoop since 2009 is rejecting new

MapReduce algorithm

implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark

bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this

DSL are automatically optimized and executed in

parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for

Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

40

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov

April 2014

httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with

Mahout Scala and Spark Published on May 30 2014

httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-

with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)-

MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

41

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

42

3 IntegrationService Open Source Tool

StorageServi

ng Layer

Data Formats

Data

Ingestion

Services

Resource

Management

Search

SQL

43

3 Integration

bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851

44

3 Integration

bull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via

newAPIHadoopRDD Example HBaseTestscala from

Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach

esparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available

for reading from and writing to HBase without the need

of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with

Spark Status Still in experimentation and no timetable for

possible support httpblogclouderacomblog201412new-in-cloudera-

labs-sparkonhbase

45

3 Integration

bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark

RDDs to Cassandra tables and execute arbitrary CQL

queries in your Spark applications Supports also

integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration

is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra

46

3 Integration

bull Benchmark of Spark amp Cassandra Integration

using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume

data from Cassandra to spark and store Resilient

Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new

avenues

bull Kindling An Introduction to Spark with Cassandra

(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-

spark-with-cassandra

47

3 Integration

bull MongoDB is not directly served by Spark although

it can be used from Spark via an official Mongo-

Hadoop connector

bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-

insights

bull Spark SQL also provides indirect support via its

support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

48

3 Integration

bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from

Apache Spark (still experimental)

bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-

introduction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-

example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-

example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without

Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

49

3 Integration

bull Neo4j is a highly scalable robust (fully ACID) native graph

database

bull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015

httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015

httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph

Analytics By Kenny Bastani November 3 2014

httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

50

3 Integration YARN

bull YARN Yet Another Resource Negotiator Implicit

reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND

20summary20~20yarn20AND20status203D20OPEN20ORDER20

BY20priority20DESC0A

bull Some issues are critical ones

bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

51

3 Integration

bull Spark SQL provides built in support for Hivetables

bull Import relational data from Hive tables

bull Run SQL queries over imported data

bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120

bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib

52

3 Integration

bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to

address new use cases

bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data

sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query

in-memory data in Spark Embed Drill execution in a

Spark data pipeline

Source Whats Coming in 2015 for

Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

53

3 Integration

bull Apache Kafka is a high throughput distributed

messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka

Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming

Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-

example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka

54

3 Integration

bull Apache Flume is a streaming event data

ingestion system that is designed for Big Data

ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with

Flume There are two approaches to this

bull Approach 1 Flume-style Push-based Approach

bull Approach 2 (Experimental) Pull-based

Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

55

3 Integration

bull Spark SQL provides built in support for JSON that

is vastly simplifying the end-to-end-experience of

working with JSON data

bull Spark SQL can automatically infer the schema

of a JSON dataset and load it as a

SchemaRDD No more DDL Just point Spark

SQL to JSON files and query Starting Spark 13

SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-

support-in-spark-sqlhtml

56

3 Integration

bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows to

bull Import relational data from Parquet files

bull Run SQL queries over imported data

bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

57

3 Integration

bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015

bull Problem

bull Various inbound data sets

bull Data Layout can change without notice

bull New data sets can be added without notice

Result

bull Leverage Spark to dynamically split the data

bull Leverage Avro to store the data in a compact binary format

58

3 Integration Kite SDK

bull The Kite SDK provides high level abstractions to

work with datasets on Hadoop hiding many of

the details of compression codecs file formats

partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016

release so Spark jobs can read and write to Kite

datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

59

3 Integration

bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between

Elasticsearch and Apache Spark in the form of RDD that can

read data from Elasticsearch Also any RDD can be saved to

Elasticsearch as long as its content can be translated into

documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache

Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html

60

3 Integration

bull Apache Solr added a Spark-based indexing tool for

fast and easy indexing ingestion and serving

searchable complex data ldquoCrunchIndexerTool on

Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark

Crunch and Morphlines

bull Migrate ingestion of HDFS data into Solr from

MapReduce to Spark

bull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-

intosolrusingsparktrimmed

61

3 Integration

bull HUE is the open source Apache Hadoop Web UI

that lets users use Hadoop directly from their

browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark

Igniter lets users execute and monitor Spark jobs

directly from their browser and be more

productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-

hadoop-by-enrico-berti-at-big-data-spain-2014

62

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

63

4 Complementarity

Components of Hadoop ecosystem and Spark ecosystem

can work together each for what it is especially good at

rather than choosing one of them

64

Hadoop ecosystem Spark ecosystem

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

65

4 Complementarity +

bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesos

bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41

66

4 Complementarity +

References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN

cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache

Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-

resource-management

bull YARN vs MESOS Canrsquot We All Just Get

Along httpstrataconfcombig-data-conference-ca-

2015publicscheduledetail40620

67

4 Complementarity +

bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization

strategies (building the DAG with knowledge of data

distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the

need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with

YARN (resource chaining in clusters)

bull Tez supports enterprise security

68

4 Complementarity +

bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better

since it is more ldquostream orientedrdquo has more mature

shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory

parsed data it can be much better when we process

data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native

YARN Integration httphortonworkscomblogimproving-spark-data-

pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

69

4 Complementarity

bull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal

compute framework at each step in the big data

analytics process based on the type of platform the

attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on

November 13 2014 with Matt Schumpert Director of Product

Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution

Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-

right-execution-enginehtml

70

4 Complementarity

bull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles

Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-

2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to

Mainstream Apache Hadoop Adoption February 12 2015

httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-

Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms

February 23 2015

httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-

migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015

httpblogsyncsortcom201503framework-future-hadoop

71

5 Key Takeaways1 Evolution of compute models is still ongoing

Watch out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

72

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

73

1 File System

Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml

bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift

74

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 20: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

5 Key Takeaways

1 Big Data Still one of the most inflated

buzzword

2 Typical Big Data Stack Big Data Stacks look

similar on paper Arenrsquot they

3 Apache Hadoop Hadoop is no longer

lsquosynonymousrsquo of Big Data

4 Apache Spark Emergence of the Apache

Spark ecosystem

20

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

21

1 Evolution of Programming APIs

bull MapReduce in Java is like assembly code of Big

Data httpwikiapacheorghadoopWordCount

bull Pig httppigapacheorg

bull Hive httphiveapacheorg

bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi

bull Cascading httpwwwcascadingorg

bull Scalding A Scala API for Cascading httptwittercomscalding

bull Crunch httpcrunchapacheorg

bull Scrunch httpcrunchapacheorgscrunchhtml

22

1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007

MapReduce v1 was the only choice as a compute model

(Execution Engine) on Hadoop Now we have in addition

to MapReduce v2 Tez Spark and Flink

23

bull Batch bull Batch

bull Interactive

bull Batch

bull Interactive

bull Near-Real

time

bull Batch

bull Interactive

bull Real-Time

bull Iterative

bull 1st

Generation

bull 2nd

Generation

bull 3rd

Generation

bull 4th

Generation

1 Evolution

bull This is how Hadoop MapReduce is branding itself ldquoA YARN-based system for parallel processing of large data sets httphadoopapacheorg

bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip

bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job

bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics

24

1 Evolution

bull Tez Hindi for ldquospeedrdquo

bull This is how Apache Tez is branding itself ldquoTheApache Tez project is aimed at building anapplication framework which allows for a complexdirected-acyclic-graph of tasks for processingdata It is currently built atop YARNrdquo

Source httptezapacheorg

bull Apachetrade Tez is an extensible framework for

building high performance batch and

interactive data processing applicationscoordinated by YARN in Apache Hadoop

25

1 Evolution

bull lsquoSparkrsquo for lightning fast speed

bull This is how Apache Spark is branding itselfldquoApache Sparktrade is a fast and general engine forlarge-scale data processingrdquo httpssparkapacheorg

bull Apache Spark is a general purpose clustercomputing framework its execution modelsupports wide variety of use cases batchinteractive near-real time

bull The rapid in-memory processing of resilientdistributed datasets (RDDs) is the ldquocorecapabilityrdquo of Apache Spark

26

1 Evolution Apache Flink

bull Flink German for ldquonimble swift speedyrdquo

bull This is how Apache Flink is branding itself ldquoFast andreliable large-scale data processing enginerdquo

bull Apache Flink httpflinkapacheorg offers

bull Batch and Streaming in the same system

bull Beyond DAGs (Cyclic operator graphs)

bull Powerful expressive APIs

bull Inside-the-system iterations

bull Full Hadoop compatibility

bull Automatic language independent optimizer

bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink

27

Hadoop MapReduce vs Tez vs Spark

Criteria

License Open Source

Apache 20 version

2x

Open Source

Apache 20

version 0x

Open Source

Apache 20 version

1x

Processing

Model

On-Disk (Disk-

based

parallelization)

Batch

On-Disk Batch

Interactive

In-Memory On-Disk

Batch Interactive

Streaming (Near Real-

Time)

Language written

in

Java Java Scala

API [Java Python

Scala] User-Facing

Java[

ISVEngineTool

builder]

[Scala Java Python]

User-Facing

Libraries None separate tools None [Spark Core Spark

Streaming Spark SQL

MLlib GraphX]

28

Hadoop MapReduce vs Tez vs Spark

Criteria

Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to

Hadoop

Ease of Use Difficult to program

needs abstractions

No Interactive mode

except Hive Pig

Difficult to program

No Interactive

mode except Hive

Pig

Easy to program

no need of

abstractions

Interactive mode

Compatibilit

y

to data types and data

sources is same

to data types and

data sources is

same

to data types and

data sources is

same

YARN

integration

YARN application Ground up YARN

application

Spark is moving

towards YARN

29

Hadoop MapReduce vs Tez vs Spark

Criteria

Deployment YARN YARN [Standalone YARN

SIMR Mesos hellip]

Performance - Good performance

when data fits into

memory

- performance

degradation otherwise

Security More features and

projects

More

features and

projects

Still in its infancy

30

Partial support

IV Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

31

2 Transition

bull Existing Hadoop MapReduce projects can

migrate to Spark and leverage Spark Core as

execution engine

1 You can often reuse your mapper and

reducer functions and just call them in

Spark from Java or Scala

2 You can translate your code from

MapReduce to Apache Spark How-to

Translate from MapReduce to Apache Sparkhttpblogclouderacomblog201409how-to-translate-from-mapreduce-to-

apache-spark

32

2 Transition

3 The following tools originally based on Hadoop

MapReduce are being ported to Apache Spark

bull Pig Hive Sqoop Cascading Crunch Mahout hellip

33

Pig on Spark (Spork)

bull Run Pig with ldquondashx sparkrdquo option for an easy migration

without development effort

bull Speed up your existing pig scripts on Spark ( Query

Logical Plan Physical Pan)

bull Leverage new Spark specific operators in Pig such as

Cache

bull Still leverage many existing Pig UDF libraries

bull Pig on Spark Umbrella Jira (Status Passed end-to-end test

cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059

bull Fix outstanding issues and address additional Spark functionality

through the community

bull lsquoPig on Sparkrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag19

34

Hive on Spark (Currently in Beta

Expected in Hive 110)

bull New alternative to using MapReduce or Tez

hivegt set hiveexecutionengine=spark

bull Help existing Hive applications running on

MapReduce or Tez easily migrate to Spark without

development effort

bull Exposes Spark users to a viable feature-rich de facto

standard SQL tool on Hadoop

bull Performance benefits especially for Hive queries

involving multiple reducer stages

bull Hive on Spark Umbrella Jira (Status Open) Q1 2015httpsissuesapacheorgjirabrowseHIVE-7292

35

Hive on Spark (Currently in Beta

Expected in Hive 110)

bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-

motivations-and-design-principles

bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Start

ed

bull Hive on Spark February 11 2015 Szehon Ho

Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark

bull Hive on spark is blazing fast or is it Carter Shanklin and

Mostapah Mokhtar (Hortonworks) February 20 2015httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final

bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12

36

Sqoop on Spark

(Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially

developed as a tool to transfer data from RDBMS to

Hadoop

bull The next version of Sqoop referred to as Sqoop2

supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under

discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Pro

posal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira

Status Work In Progress) The goal of this ticket is to support a

pluggable way to select the execution engine on which we can run

the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

37

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application

development platform for building data applications on

Hadoop

bull Support for Apache Spark is on the roadmap and will be

available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the

transition from CascadingScalding to Spark a little

easier by adding support for Cascading Taps Scalding

Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark

38

Apache Crunch

bull The Apache Crunch Java library provides a

framework for writing testing and running

MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a

SparkPipeline class making it easy to migrate

data processing applications from MapReduce

to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark

Pipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-

xtopicscdh_ig_running_crunch_with_sparkhtml

39

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce

Apache Mahout the original Machine Learning (ML)

library for Hadoop since 2009 is rejecting new

MapReduce algorithm

implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark

bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this

DSL are automatically optimized and executed in

parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for

Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

40

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov

April 2014

httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with

Mahout Scala and Spark Published on May 30 2014

httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-

with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)-

MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

41

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

42

3 IntegrationService Open Source Tool

StorageServi

ng Layer

Data Formats

Data

Ingestion

Services

Resource

Management

Search

SQL

43

3 Integration

bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851

44

3 Integration

bull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via

newAPIHadoopRDD Example HBaseTestscala from

Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach

esparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available

for reading from and writing to HBase without the need

of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with

Spark Status Still in experimentation and no timetable for

possible support httpblogclouderacomblog201412new-in-cloudera-

labs-sparkonhbase

45

3 Integration

bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark

RDDs to Cassandra tables and execute arbitrary CQL

queries in your Spark applications Supports also

integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration

is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra

46

3 Integration

bull Benchmark of Spark amp Cassandra Integration

using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume

data from Cassandra to spark and store Resilient

Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new

avenues

bull Kindling An Introduction to Spark with Cassandra

(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-

spark-with-cassandra

47

3 Integration

bull MongoDB is not directly served by Spark although

it can be used from Spark via an official Mongo-

Hadoop connector

bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-

insights

bull Spark SQL also provides indirect support via its

support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

48

3 Integration

bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from

Apache Spark (still experimental)

bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-

introduction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-

example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-

example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without

Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

49

3 Integration

bull Neo4j is a highly scalable robust (fully ACID) native graph

database

bull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015

httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015

httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph

Analytics By Kenny Bastani November 3 2014

httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

50

3 Integration YARN

bull YARN Yet Another Resource Negotiator Implicit

reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND

20summary20~20yarn20AND20status203D20OPEN20ORDER20

BY20priority20DESC0A

bull Some issues are critical ones

bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

51

3 Integration

bull Spark SQL provides built in support for Hivetables

bull Import relational data from Hive tables

bull Run SQL queries over imported data

bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120

bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib

52

3 Integration

bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to

address new use cases

bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data

sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query

in-memory data in Spark Embed Drill execution in a

Spark data pipeline

Source Whats Coming in 2015 for

Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

53

3 Integration

bull Apache Kafka is a high throughput distributed

messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka

Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming

Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-

example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka

54

3 Integration

bull Apache Flume is a streaming event data

ingestion system that is designed for Big Data

ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with

Flume There are two approaches to this

bull Approach 1 Flume-style Push-based Approach

bull Approach 2 (Experimental) Pull-based

Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

55

3 Integration

bull Spark SQL provides built in support for JSON that

is vastly simplifying the end-to-end-experience of

working with JSON data

bull Spark SQL can automatically infer the schema

of a JSON dataset and load it as a

SchemaRDD No more DDL Just point Spark

SQL to JSON files and query Starting Spark 13

SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-

support-in-spark-sqlhtml

56

3 Integration

bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows to

bull Import relational data from Parquet files

bull Run SQL queries over imported data

bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

57

3 Integration

bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015

bull Problem

bull Various inbound data sets

bull Data Layout can change without notice

bull New data sets can be added without notice

Result

bull Leverage Spark to dynamically split the data

bull Leverage Avro to store the data in a compact binary format

58

3 Integration Kite SDK

bull The Kite SDK provides high level abstractions to

work with datasets on Hadoop hiding many of

the details of compression codecs file formats

partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016

release so Spark jobs can read and write to Kite

datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

59

3 Integration

bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between

Elasticsearch and Apache Spark in the form of RDD that can

read data from Elasticsearch Also any RDD can be saved to

Elasticsearch as long as its content can be translated into

documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache

Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html

60

3 Integration

bull Apache Solr added a Spark-based indexing tool for

fast and easy indexing ingestion and serving

searchable complex data ldquoCrunchIndexerTool on

Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark

Crunch and Morphlines

bull Migrate ingestion of HDFS data into Solr from

MapReduce to Spark

bull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-

intosolrusingsparktrimmed

61

3 Integration

bull HUE is the open source Apache Hadoop Web UI

that lets users use Hadoop directly from their

browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark

Igniter lets users execute and monitor Spark jobs

directly from their browser and be more

productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-

hadoop-by-enrico-berti-at-big-data-spain-2014

62

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

63

4 Complementarity

Components of Hadoop ecosystem and Spark ecosystem

can work together each for what it is especially good at

rather than choosing one of them

64

Hadoop ecosystem Spark ecosystem

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

65

4 Complementarity +

bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesos

bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41

66

4 Complementarity +

References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN

cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache

Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-

resource-management

bull YARN vs MESOS Canrsquot We All Just Get

Along httpstrataconfcombig-data-conference-ca-

2015publicscheduledetail40620

67

4 Complementarity +

bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization

strategies (building the DAG with knowledge of data

distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the

need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with

YARN (resource chaining in clusters)

bull Tez supports enterprise security

68

4 Complementarity +

bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better

since it is more ldquostream orientedrdquo has more mature

shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory

parsed data it can be much better when we process

data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native

YARN Integration httphortonworkscomblogimproving-spark-data-

pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

69

4 Complementarity

bull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal

compute framework at each step in the big data

analytics process based on the type of platform the

attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on

November 13 2014 with Matt Schumpert Director of Product

Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution

Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-

right-execution-enginehtml

70

4 Complementarity

bull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles

Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-

2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to

Mainstream Apache Hadoop Adoption February 12 2015

httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-

Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms

February 23 2015

httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-

migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015

httpblogsyncsortcom201503framework-future-hadoop

71

5 Key Takeaways1 Evolution of compute models is still ongoing

Watch out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

72

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

73

1 File System

Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml

bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift

74

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 21: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

21

1 Evolution of Programming APIs

bull MapReduce in Java is like assembly code of Big

Data httpwikiapacheorghadoopWordCount

bull Pig httppigapacheorg

bull Hive httphiveapacheorg

bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi

bull Cascading httpwwwcascadingorg

bull Scalding A Scala API for Cascading httptwittercomscalding

bull Crunch httpcrunchapacheorg

bull Scrunch httpcrunchapacheorgscrunchhtml

22

1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007

MapReduce v1 was the only choice as a compute model

(Execution Engine) on Hadoop Now we have in addition

to MapReduce v2 Tez Spark and Flink

23

bull Batch bull Batch

bull Interactive

bull Batch

bull Interactive

bull Near-Real

time

bull Batch

bull Interactive

bull Real-Time

bull Iterative

bull 1st

Generation

bull 2nd

Generation

bull 3rd

Generation

bull 4th

Generation

1 Evolution

bull This is how Hadoop MapReduce is branding itself ldquoA YARN-based system for parallel processing of large data sets httphadoopapacheorg

bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip

bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job

bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics

24

1 Evolution

bull Tez Hindi for ldquospeedrdquo

bull This is how Apache Tez is branding itself ldquoTheApache Tez project is aimed at building anapplication framework which allows for a complexdirected-acyclic-graph of tasks for processingdata It is currently built atop YARNrdquo

Source httptezapacheorg

bull Apachetrade Tez is an extensible framework for

building high performance batch and

interactive data processing applicationscoordinated by YARN in Apache Hadoop

25

1 Evolution

bull lsquoSparkrsquo for lightning fast speed

bull This is how Apache Spark is branding itselfldquoApache Sparktrade is a fast and general engine forlarge-scale data processingrdquo httpssparkapacheorg

bull Apache Spark is a general purpose clustercomputing framework its execution modelsupports wide variety of use cases batchinteractive near-real time

bull The rapid in-memory processing of resilientdistributed datasets (RDDs) is the ldquocorecapabilityrdquo of Apache Spark

26

1 Evolution Apache Flink

bull Flink German for ldquonimble swift speedyrdquo

bull This is how Apache Flink is branding itself ldquoFast andreliable large-scale data processing enginerdquo

bull Apache Flink httpflinkapacheorg offers

bull Batch and Streaming in the same system

bull Beyond DAGs (Cyclic operator graphs)

bull Powerful expressive APIs

bull Inside-the-system iterations

bull Full Hadoop compatibility

bull Automatic language independent optimizer

bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink

27

Hadoop MapReduce vs Tez vs Spark

Criteria

License Open Source

Apache 20 version

2x

Open Source

Apache 20

version 0x

Open Source

Apache 20 version

1x

Processing

Model

On-Disk (Disk-

based

parallelization)

Batch

On-Disk Batch

Interactive

In-Memory On-Disk

Batch Interactive

Streaming (Near Real-

Time)

Language written

in

Java Java Scala

API [Java Python

Scala] User-Facing

Java[

ISVEngineTool

builder]

[Scala Java Python]

User-Facing

Libraries None separate tools None [Spark Core Spark

Streaming Spark SQL

MLlib GraphX]

28

Hadoop MapReduce vs Tez vs Spark

Criteria

Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to

Hadoop

Ease of Use Difficult to program

needs abstractions

No Interactive mode

except Hive Pig

Difficult to program

No Interactive

mode except Hive

Pig

Easy to program

no need of

abstractions

Interactive mode

Compatibilit

y

to data types and data

sources is same

to data types and

data sources is

same

to data types and

data sources is

same

YARN

integration

YARN application Ground up YARN

application

Spark is moving

towards YARN

29

Hadoop MapReduce vs Tez vs Spark

Criteria

Deployment YARN YARN [Standalone YARN

SIMR Mesos hellip]

Performance - Good performance

when data fits into

memory

- performance

degradation otherwise

Security More features and

projects

More

features and

projects

Still in its infancy

30

Partial support

IV Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

31

2 Transition

bull Existing Hadoop MapReduce projects can

migrate to Spark and leverage Spark Core as

execution engine

1 You can often reuse your mapper and

reducer functions and just call them in

Spark from Java or Scala

2 You can translate your code from

MapReduce to Apache Spark How-to

Translate from MapReduce to Apache Sparkhttpblogclouderacomblog201409how-to-translate-from-mapreduce-to-

apache-spark

32

2 Transition

3 The following tools originally based on Hadoop

MapReduce are being ported to Apache Spark

bull Pig Hive Sqoop Cascading Crunch Mahout hellip

33

Pig on Spark (Spork)

bull Run Pig with ldquondashx sparkrdquo option for an easy migration

without development effort

bull Speed up your existing pig scripts on Spark ( Query

Logical Plan Physical Pan)

bull Leverage new Spark specific operators in Pig such as

Cache

bull Still leverage many existing Pig UDF libraries

bull Pig on Spark Umbrella Jira (Status Passed end-to-end test

cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059

bull Fix outstanding issues and address additional Spark functionality

through the community

bull lsquoPig on Sparkrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag19

34

Hive on Spark (Currently in Beta

Expected in Hive 110)

bull New alternative to using MapReduce or Tez

hivegt set hiveexecutionengine=spark

bull Help existing Hive applications running on

MapReduce or Tez easily migrate to Spark without

development effort

bull Exposes Spark users to a viable feature-rich de facto

standard SQL tool on Hadoop

bull Performance benefits especially for Hive queries

involving multiple reducer stages

bull Hive on Spark Umbrella Jira (Status Open) Q1 2015httpsissuesapacheorgjirabrowseHIVE-7292

35

Hive on Spark (Currently in Beta

Expected in Hive 110)

bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-

motivations-and-design-principles

bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Start

ed

bull Hive on Spark February 11 2015 Szehon Ho

Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark

bull Hive on spark is blazing fast or is it Carter Shanklin and

Mostapah Mokhtar (Hortonworks) February 20 2015httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final

bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12

36

Sqoop on Spark

(Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially

developed as a tool to transfer data from RDBMS to

Hadoop

bull The next version of Sqoop referred to as Sqoop2

supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under

discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Pro

posal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira

Status Work In Progress) The goal of this ticket is to support a

pluggable way to select the execution engine on which we can run

the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

37

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application

development platform for building data applications on

Hadoop

bull Support for Apache Spark is on the roadmap and will be

available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the

transition from CascadingScalding to Spark a little

easier by adding support for Cascading Taps Scalding

Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark

38

Apache Crunch

bull The Apache Crunch Java library provides a

framework for writing testing and running

MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a

SparkPipeline class making it easy to migrate

data processing applications from MapReduce

to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark

Pipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-

xtopicscdh_ig_running_crunch_with_sparkhtml

39

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce

Apache Mahout the original Machine Learning (ML)

library for Hadoop since 2009 is rejecting new

MapReduce algorithm

implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark

bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this

DSL are automatically optimized and executed in

parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for

Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

40

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov

April 2014

httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with

Mahout Scala and Spark Published on May 30 2014

httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-

with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)-

MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

41

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

42

3 IntegrationService Open Source Tool

StorageServi

ng Layer

Data Formats

Data

Ingestion

Services

Resource

Management

Search

SQL

43

3 Integration

bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851

44

3 Integration

bull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via

newAPIHadoopRDD Example HBaseTestscala from

Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach

esparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available

for reading from and writing to HBase without the need

of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with

Spark Status Still in experimentation and no timetable for

possible support httpblogclouderacomblog201412new-in-cloudera-

labs-sparkonhbase

45

3 Integration

bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark

RDDs to Cassandra tables and execute arbitrary CQL

queries in your Spark applications Supports also

integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration

is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra

46

3 Integration

bull Benchmark of Spark amp Cassandra Integration

using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume

data from Cassandra to spark and store Resilient

Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new

avenues

bull Kindling An Introduction to Spark with Cassandra

(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-

spark-with-cassandra

47

3 Integration

bull MongoDB is not directly served by Spark although

it can be used from Spark via an official Mongo-

Hadoop connector

bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-

insights

bull Spark SQL also provides indirect support via its

support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

48

3 Integration

bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from

Apache Spark (still experimental)

bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-

introduction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-

example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-

example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without

Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

49

3 Integration

bull Neo4j is a highly scalable robust (fully ACID) native graph

database

bull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015

httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015

httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph

Analytics By Kenny Bastani November 3 2014

httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

50

3 Integration YARN

bull YARN Yet Another Resource Negotiator Implicit

reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND

20summary20~20yarn20AND20status203D20OPEN20ORDER20

BY20priority20DESC0A

bull Some issues are critical ones

bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

51

3 Integration

bull Spark SQL provides built in support for Hivetables

bull Import relational data from Hive tables

bull Run SQL queries over imported data

bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120

bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib

52

3 Integration

bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to

address new use cases

bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data

sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query

in-memory data in Spark Embed Drill execution in a

Spark data pipeline

Source Whats Coming in 2015 for

Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

53

3 Integration

bull Apache Kafka is a high throughput distributed

messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka

Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming

Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-

example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka

54

3 Integration

bull Apache Flume is a streaming event data

ingestion system that is designed for Big Data

ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with

Flume There are two approaches to this

bull Approach 1 Flume-style Push-based Approach

bull Approach 2 (Experimental) Pull-based

Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

55

3 Integration

bull Spark SQL provides built in support for JSON that

is vastly simplifying the end-to-end-experience of

working with JSON data

bull Spark SQL can automatically infer the schema

of a JSON dataset and load it as a

SchemaRDD No more DDL Just point Spark

SQL to JSON files and query Starting Spark 13

SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-

support-in-spark-sqlhtml

56

3 Integration

bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows to

bull Import relational data from Parquet files

bull Run SQL queries over imported data

bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

57

3 Integration

bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015

bull Problem

bull Various inbound data sets

bull Data Layout can change without notice

bull New data sets can be added without notice

Result

bull Leverage Spark to dynamically split the data

bull Leverage Avro to store the data in a compact binary format

58

3 Integration Kite SDK

bull The Kite SDK provides high level abstractions to

work with datasets on Hadoop hiding many of

the details of compression codecs file formats

partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016

release so Spark jobs can read and write to Kite

datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

59

3 Integration

bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between

Elasticsearch and Apache Spark in the form of RDD that can

read data from Elasticsearch Also any RDD can be saved to

Elasticsearch as long as its content can be translated into

documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache

Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html

60

3 Integration

bull Apache Solr added a Spark-based indexing tool for

fast and easy indexing ingestion and serving

searchable complex data ldquoCrunchIndexerTool on

Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark

Crunch and Morphlines

bull Migrate ingestion of HDFS data into Solr from

MapReduce to Spark

bull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-

intosolrusingsparktrimmed

61

3 Integration

bull HUE is the open source Apache Hadoop Web UI

that lets users use Hadoop directly from their

browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark

Igniter lets users execute and monitor Spark jobs

directly from their browser and be more

productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-

hadoop-by-enrico-berti-at-big-data-spain-2014

62

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

63

4 Complementarity

Components of Hadoop ecosystem and Spark ecosystem

can work together each for what it is especially good at

rather than choosing one of them

64

Hadoop ecosystem Spark ecosystem

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

65

4 Complementarity +

bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesos

bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41

66

4 Complementarity +

References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN

cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache

Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-

resource-management

bull YARN vs MESOS Canrsquot We All Just Get

Along httpstrataconfcombig-data-conference-ca-

2015publicscheduledetail40620

67

4 Complementarity +

bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization

strategies (building the DAG with knowledge of data

distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the

need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with

YARN (resource chaining in clusters)

bull Tez supports enterprise security

68

4 Complementarity +

bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better

since it is more ldquostream orientedrdquo has more mature

shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory

parsed data it can be much better when we process

data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native

YARN Integration httphortonworkscomblogimproving-spark-data-

pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

69

4 Complementarity

bull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal

compute framework at each step in the big data

analytics process based on the type of platform the

attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on

November 13 2014 with Matt Schumpert Director of Product

Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution

Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-

right-execution-enginehtml

70

4 Complementarity

bull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles

Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-

2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to

Mainstream Apache Hadoop Adoption February 12 2015

httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-

Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms

February 23 2015

httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-

migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015

httpblogsyncsortcom201503framework-future-hadoop

71

5 Key Takeaways1 Evolution of compute models is still ongoing

Watch out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

72

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

73

1 File System

Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml

bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift

74

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 22: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

1 Evolution of Programming APIs

bull MapReduce in Java is like assembly code of Big

Data httpwikiapacheorghadoopWordCount

bull Pig httppigapacheorg

bull Hive httphiveapacheorg

bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi

bull Cascading httpwwwcascadingorg

bull Scalding A Scala API for Cascading httptwittercomscalding

bull Crunch httpcrunchapacheorg

bull Scrunch httpcrunchapacheorgscrunchhtml

22

1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007

MapReduce v1 was the only choice as a compute model

(Execution Engine) on Hadoop Now we have in addition

to MapReduce v2 Tez Spark and Flink

23

bull Batch bull Batch

bull Interactive

bull Batch

bull Interactive

bull Near-Real

time

bull Batch

bull Interactive

bull Real-Time

bull Iterative

bull 1st

Generation

bull 2nd

Generation

bull 3rd

Generation

bull 4th

Generation

1 Evolution

bull This is how Hadoop MapReduce is branding itself ldquoA YARN-based system for parallel processing of large data sets httphadoopapacheorg

bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip

bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job

bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics

24

1 Evolution

bull Tez Hindi for ldquospeedrdquo

bull This is how Apache Tez is branding itself ldquoTheApache Tez project is aimed at building anapplication framework which allows for a complexdirected-acyclic-graph of tasks for processingdata It is currently built atop YARNrdquo

Source httptezapacheorg

bull Apachetrade Tez is an extensible framework for

building high performance batch and

interactive data processing applicationscoordinated by YARN in Apache Hadoop

25

1 Evolution

bull lsquoSparkrsquo for lightning fast speed

bull This is how Apache Spark is branding itselfldquoApache Sparktrade is a fast and general engine forlarge-scale data processingrdquo httpssparkapacheorg

bull Apache Spark is a general purpose clustercomputing framework its execution modelsupports wide variety of use cases batchinteractive near-real time

bull The rapid in-memory processing of resilientdistributed datasets (RDDs) is the ldquocorecapabilityrdquo of Apache Spark

26

1 Evolution Apache Flink

bull Flink German for ldquonimble swift speedyrdquo

bull This is how Apache Flink is branding itself ldquoFast andreliable large-scale data processing enginerdquo

bull Apache Flink httpflinkapacheorg offers

bull Batch and Streaming in the same system

bull Beyond DAGs (Cyclic operator graphs)

bull Powerful expressive APIs

bull Inside-the-system iterations

bull Full Hadoop compatibility

bull Automatic language independent optimizer

bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink

27

Hadoop MapReduce vs Tez vs Spark

Criteria

License Open Source

Apache 20 version

2x

Open Source

Apache 20

version 0x

Open Source

Apache 20 version

1x

Processing

Model

On-Disk (Disk-

based

parallelization)

Batch

On-Disk Batch

Interactive

In-Memory On-Disk

Batch Interactive

Streaming (Near Real-

Time)

Language written

in

Java Java Scala

API [Java Python

Scala] User-Facing

Java[

ISVEngineTool

builder]

[Scala Java Python]

User-Facing

Libraries None separate tools None [Spark Core Spark

Streaming Spark SQL

MLlib GraphX]

28

Hadoop MapReduce vs Tez vs Spark

Criteria

Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to

Hadoop

Ease of Use Difficult to program

needs abstractions

No Interactive mode

except Hive Pig

Difficult to program

No Interactive

mode except Hive

Pig

Easy to program

no need of

abstractions

Interactive mode

Compatibilit

y

to data types and data

sources is same

to data types and

data sources is

same

to data types and

data sources is

same

YARN

integration

YARN application Ground up YARN

application

Spark is moving

towards YARN

29

Hadoop MapReduce vs Tez vs Spark

Criteria

Deployment YARN YARN [Standalone YARN

SIMR Mesos hellip]

Performance - Good performance

when data fits into

memory

- performance

degradation otherwise

Security More features and

projects

More

features and

projects

Still in its infancy

30

Partial support

IV Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

31

2 Transition

bull Existing Hadoop MapReduce projects can

migrate to Spark and leverage Spark Core as

execution engine

1 You can often reuse your mapper and

reducer functions and just call them in

Spark from Java or Scala

2 You can translate your code from

MapReduce to Apache Spark How-to

Translate from MapReduce to Apache Sparkhttpblogclouderacomblog201409how-to-translate-from-mapreduce-to-

apache-spark

32

2 Transition

3 The following tools originally based on Hadoop

MapReduce are being ported to Apache Spark

bull Pig Hive Sqoop Cascading Crunch Mahout hellip

33

Pig on Spark (Spork)

bull Run Pig with ldquondashx sparkrdquo option for an easy migration

without development effort

bull Speed up your existing pig scripts on Spark ( Query

Logical Plan Physical Pan)

bull Leverage new Spark specific operators in Pig such as

Cache

bull Still leverage many existing Pig UDF libraries

bull Pig on Spark Umbrella Jira (Status Passed end-to-end test

cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059

bull Fix outstanding issues and address additional Spark functionality

through the community

bull lsquoPig on Sparkrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag19

34

Hive on Spark (Currently in Beta

Expected in Hive 110)

bull New alternative to using MapReduce or Tez

hivegt set hiveexecutionengine=spark

bull Help existing Hive applications running on

MapReduce or Tez easily migrate to Spark without

development effort

bull Exposes Spark users to a viable feature-rich de facto

standard SQL tool on Hadoop

bull Performance benefits especially for Hive queries

involving multiple reducer stages

bull Hive on Spark Umbrella Jira (Status Open) Q1 2015httpsissuesapacheorgjirabrowseHIVE-7292

35

Hive on Spark (Currently in Beta

Expected in Hive 110)

bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-

motivations-and-design-principles

bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Start

ed

bull Hive on Spark February 11 2015 Szehon Ho

Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark

bull Hive on spark is blazing fast or is it Carter Shanklin and

Mostapah Mokhtar (Hortonworks) February 20 2015httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final

bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12

36

Sqoop on Spark

(Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially

developed as a tool to transfer data from RDBMS to

Hadoop

bull The next version of Sqoop referred to as Sqoop2

supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under

discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Pro

posal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira

Status Work In Progress) The goal of this ticket is to support a

pluggable way to select the execution engine on which we can run

the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

37

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application

development platform for building data applications on

Hadoop

bull Support for Apache Spark is on the roadmap and will be

available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the

transition from CascadingScalding to Spark a little

easier by adding support for Cascading Taps Scalding

Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark

38

Apache Crunch

bull The Apache Crunch Java library provides a

framework for writing testing and running

MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a

SparkPipeline class making it easy to migrate

data processing applications from MapReduce

to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark

Pipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-

xtopicscdh_ig_running_crunch_with_sparkhtml

39

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce

Apache Mahout the original Machine Learning (ML)

library for Hadoop since 2009 is rejecting new

MapReduce algorithm

implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark

bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this

DSL are automatically optimized and executed in

parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for

Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

40

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov

April 2014

httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with

Mahout Scala and Spark Published on May 30 2014

httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-

with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)-

MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

41

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

42

3 IntegrationService Open Source Tool

StorageServi

ng Layer

Data Formats

Data

Ingestion

Services

Resource

Management

Search

SQL

43

3 Integration

bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851

44

3 Integration

bull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via

newAPIHadoopRDD Example HBaseTestscala from

Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach

esparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available

for reading from and writing to HBase without the need

of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with

Spark Status Still in experimentation and no timetable for

possible support httpblogclouderacomblog201412new-in-cloudera-

labs-sparkonhbase

45

3 Integration

bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark

RDDs to Cassandra tables and execute arbitrary CQL

queries in your Spark applications Supports also

integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration

is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra

46

3 Integration

bull Benchmark of Spark amp Cassandra Integration

using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume

data from Cassandra to spark and store Resilient

Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new

avenues

bull Kindling An Introduction to Spark with Cassandra

(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-

spark-with-cassandra

47

3 Integration

bull MongoDB is not directly served by Spark although

it can be used from Spark via an official Mongo-

Hadoop connector

bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-

insights

bull Spark SQL also provides indirect support via its

support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

48

3 Integration

bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from

Apache Spark (still experimental)

bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-

introduction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-

example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-

example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without

Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

49

3 Integration

bull Neo4j is a highly scalable robust (fully ACID) native graph

database

bull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015

httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015

httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph

Analytics By Kenny Bastani November 3 2014

httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

50

3 Integration YARN

bull YARN Yet Another Resource Negotiator Implicit

reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND

20summary20~20yarn20AND20status203D20OPEN20ORDER20

BY20priority20DESC0A

bull Some issues are critical ones

bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

51

3 Integration

bull Spark SQL provides built in support for Hivetables

bull Import relational data from Hive tables

bull Run SQL queries over imported data

bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120

bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib

52

3 Integration

bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to

address new use cases

bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data

sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query

in-memory data in Spark Embed Drill execution in a

Spark data pipeline

Source Whats Coming in 2015 for

Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

53

3 Integration

bull Apache Kafka is a high throughput distributed

messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka

Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming

Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-

example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka

54

3 Integration

bull Apache Flume is a streaming event data

ingestion system that is designed for Big Data

ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with

Flume There are two approaches to this

bull Approach 1 Flume-style Push-based Approach

bull Approach 2 (Experimental) Pull-based

Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

55

3 Integration

bull Spark SQL provides built in support for JSON that

is vastly simplifying the end-to-end-experience of

working with JSON data

bull Spark SQL can automatically infer the schema

of a JSON dataset and load it as a

SchemaRDD No more DDL Just point Spark

SQL to JSON files and query Starting Spark 13

SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-

support-in-spark-sqlhtml

56

3 Integration

bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows to

bull Import relational data from Parquet files

bull Run SQL queries over imported data

bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

57

3 Integration

bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015

bull Problem

bull Various inbound data sets

bull Data Layout can change without notice

bull New data sets can be added without notice

Result

bull Leverage Spark to dynamically split the data

bull Leverage Avro to store the data in a compact binary format

58

3 Integration Kite SDK

bull The Kite SDK provides high level abstractions to

work with datasets on Hadoop hiding many of

the details of compression codecs file formats

partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016

release so Spark jobs can read and write to Kite

datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

59

3 Integration

bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between

Elasticsearch and Apache Spark in the form of RDD that can

read data from Elasticsearch Also any RDD can be saved to

Elasticsearch as long as its content can be translated into

documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache

Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html

60

3 Integration

bull Apache Solr added a Spark-based indexing tool for

fast and easy indexing ingestion and serving

searchable complex data ldquoCrunchIndexerTool on

Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark

Crunch and Morphlines

bull Migrate ingestion of HDFS data into Solr from

MapReduce to Spark

bull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-

intosolrusingsparktrimmed

61

3 Integration

bull HUE is the open source Apache Hadoop Web UI

that lets users use Hadoop directly from their

browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark

Igniter lets users execute and monitor Spark jobs

directly from their browser and be more

productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-

hadoop-by-enrico-berti-at-big-data-spain-2014

62

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

63

4 Complementarity

Components of Hadoop ecosystem and Spark ecosystem

can work together each for what it is especially good at

rather than choosing one of them

64

Hadoop ecosystem Spark ecosystem

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

65

4 Complementarity +

bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesos

bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41

66

4 Complementarity +

References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN

cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache

Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-

resource-management

bull YARN vs MESOS Canrsquot We All Just Get

Along httpstrataconfcombig-data-conference-ca-

2015publicscheduledetail40620

67

4 Complementarity +

bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization

strategies (building the DAG with knowledge of data

distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the

need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with

YARN (resource chaining in clusters)

bull Tez supports enterprise security

68

4 Complementarity +

bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better

since it is more ldquostream orientedrdquo has more mature

shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory

parsed data it can be much better when we process

data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native

YARN Integration httphortonworkscomblogimproving-spark-data-

pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

69

4 Complementarity

bull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal

compute framework at each step in the big data

analytics process based on the type of platform the

attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on

November 13 2014 with Matt Schumpert Director of Product

Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution

Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-

right-execution-enginehtml

70

4 Complementarity

bull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles

Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-

2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to

Mainstream Apache Hadoop Adoption February 12 2015

httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-

Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms

February 23 2015

httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-

migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015

httpblogsyncsortcom201503framework-future-hadoop

71

5 Key Takeaways1 Evolution of compute models is still ongoing

Watch out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

72

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

73

1 File System

Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml

bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift

74

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 23: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007

MapReduce v1 was the only choice as a compute model

(Execution Engine) on Hadoop Now we have in addition

to MapReduce v2 Tez Spark and Flink

23

bull Batch bull Batch

bull Interactive

bull Batch

bull Interactive

bull Near-Real

time

bull Batch

bull Interactive

bull Real-Time

bull Iterative

bull 1st

Generation

bull 2nd

Generation

bull 3rd

Generation

bull 4th

Generation

1 Evolution

bull This is how Hadoop MapReduce is branding itself ldquoA YARN-based system for parallel processing of large data sets httphadoopapacheorg

bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip

bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job

bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics

24

1 Evolution

bull Tez Hindi for ldquospeedrdquo

bull This is how Apache Tez is branding itself ldquoTheApache Tez project is aimed at building anapplication framework which allows for a complexdirected-acyclic-graph of tasks for processingdata It is currently built atop YARNrdquo

Source httptezapacheorg

bull Apachetrade Tez is an extensible framework for

building high performance batch and

interactive data processing applicationscoordinated by YARN in Apache Hadoop

25

1 Evolution

bull lsquoSparkrsquo for lightning fast speed

bull This is how Apache Spark is branding itselfldquoApache Sparktrade is a fast and general engine forlarge-scale data processingrdquo httpssparkapacheorg

bull Apache Spark is a general purpose clustercomputing framework its execution modelsupports wide variety of use cases batchinteractive near-real time

bull The rapid in-memory processing of resilientdistributed datasets (RDDs) is the ldquocorecapabilityrdquo of Apache Spark

26

1 Evolution Apache Flink

bull Flink German for ldquonimble swift speedyrdquo

bull This is how Apache Flink is branding itself ldquoFast andreliable large-scale data processing enginerdquo

bull Apache Flink httpflinkapacheorg offers

bull Batch and Streaming in the same system

bull Beyond DAGs (Cyclic operator graphs)

bull Powerful expressive APIs

bull Inside-the-system iterations

bull Full Hadoop compatibility

bull Automatic language independent optimizer

bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink

27

Hadoop MapReduce vs Tez vs Spark

Criteria

License Open Source

Apache 20 version

2x

Open Source

Apache 20

version 0x

Open Source

Apache 20 version

1x

Processing

Model

On-Disk (Disk-

based

parallelization)

Batch

On-Disk Batch

Interactive

In-Memory On-Disk

Batch Interactive

Streaming (Near Real-

Time)

Language written

in

Java Java Scala

API [Java Python

Scala] User-Facing

Java[

ISVEngineTool

builder]

[Scala Java Python]

User-Facing

Libraries None separate tools None [Spark Core Spark

Streaming Spark SQL

MLlib GraphX]

28

Hadoop MapReduce vs Tez vs Spark

Criteria

Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to

Hadoop

Ease of Use Difficult to program

needs abstractions

No Interactive mode

except Hive Pig

Difficult to program

No Interactive

mode except Hive

Pig

Easy to program

no need of

abstractions

Interactive mode

Compatibilit

y

to data types and data

sources is same

to data types and

data sources is

same

to data types and

data sources is

same

YARN

integration

YARN application Ground up YARN

application

Spark is moving

towards YARN

29

Hadoop MapReduce vs Tez vs Spark

Criteria

Deployment YARN YARN [Standalone YARN

SIMR Mesos hellip]

Performance - Good performance

when data fits into

memory

- performance

degradation otherwise

Security More features and

projects

More

features and

projects

Still in its infancy

30

Partial support

IV Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

31

2 Transition

bull Existing Hadoop MapReduce projects can

migrate to Spark and leverage Spark Core as

execution engine

1 You can often reuse your mapper and

reducer functions and just call them in

Spark from Java or Scala

2 You can translate your code from

MapReduce to Apache Spark How-to

Translate from MapReduce to Apache Sparkhttpblogclouderacomblog201409how-to-translate-from-mapreduce-to-

apache-spark

32

2 Transition

3 The following tools originally based on Hadoop

MapReduce are being ported to Apache Spark

bull Pig Hive Sqoop Cascading Crunch Mahout hellip

33

Pig on Spark (Spork)

bull Run Pig with ldquondashx sparkrdquo option for an easy migration

without development effort

bull Speed up your existing pig scripts on Spark ( Query

Logical Plan Physical Pan)

bull Leverage new Spark specific operators in Pig such as

Cache

bull Still leverage many existing Pig UDF libraries

bull Pig on Spark Umbrella Jira (Status Passed end-to-end test

cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059

bull Fix outstanding issues and address additional Spark functionality

through the community

bull lsquoPig on Sparkrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag19

34

Hive on Spark (Currently in Beta

Expected in Hive 110)

bull New alternative to using MapReduce or Tez

hivegt set hiveexecutionengine=spark

bull Help existing Hive applications running on

MapReduce or Tez easily migrate to Spark without

development effort

bull Exposes Spark users to a viable feature-rich de facto

standard SQL tool on Hadoop

bull Performance benefits especially for Hive queries

involving multiple reducer stages

bull Hive on Spark Umbrella Jira (Status Open) Q1 2015httpsissuesapacheorgjirabrowseHIVE-7292

35

Hive on Spark (Currently in Beta

Expected in Hive 110)

bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-

motivations-and-design-principles

bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Start

ed

bull Hive on Spark February 11 2015 Szehon Ho

Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark

bull Hive on spark is blazing fast or is it Carter Shanklin and

Mostapah Mokhtar (Hortonworks) February 20 2015httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final

bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12

36

Sqoop on Spark

(Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially

developed as a tool to transfer data from RDBMS to

Hadoop

bull The next version of Sqoop referred to as Sqoop2

supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under

discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Pro

posal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira

Status Work In Progress) The goal of this ticket is to support a

pluggable way to select the execution engine on which we can run

the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

37

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application

development platform for building data applications on

Hadoop

bull Support for Apache Spark is on the roadmap and will be

available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the

transition from CascadingScalding to Spark a little

easier by adding support for Cascading Taps Scalding

Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark

38

Apache Crunch

bull The Apache Crunch Java library provides a

framework for writing testing and running

MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a

SparkPipeline class making it easy to migrate

data processing applications from MapReduce

to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark

Pipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-

xtopicscdh_ig_running_crunch_with_sparkhtml

39

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce

Apache Mahout the original Machine Learning (ML)

library for Hadoop since 2009 is rejecting new

MapReduce algorithm

implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark

bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this

DSL are automatically optimized and executed in

parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for

Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

40

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov

April 2014

httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with

Mahout Scala and Spark Published on May 30 2014

httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-

with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)-

MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

41

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

42

3 IntegrationService Open Source Tool

StorageServi

ng Layer

Data Formats

Data

Ingestion

Services

Resource

Management

Search

SQL

43

3 Integration

bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851

44

3 Integration

bull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via

newAPIHadoopRDD Example HBaseTestscala from

Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach

esparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available

for reading from and writing to HBase without the need

of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with

Spark Status Still in experimentation and no timetable for

possible support httpblogclouderacomblog201412new-in-cloudera-

labs-sparkonhbase

45

3 Integration

bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark

RDDs to Cassandra tables and execute arbitrary CQL

queries in your Spark applications Supports also

integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration

is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra

46

3 Integration

bull Benchmark of Spark amp Cassandra Integration

using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume

data from Cassandra to spark and store Resilient

Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new

avenues

bull Kindling An Introduction to Spark with Cassandra

(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-

spark-with-cassandra

47

3 Integration

bull MongoDB is not directly served by Spark although

it can be used from Spark via an official Mongo-

Hadoop connector

bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-

insights

bull Spark SQL also provides indirect support via its

support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

48

3 Integration

bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from

Apache Spark (still experimental)

bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-

introduction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-

example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-

example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without

Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

49

3 Integration

bull Neo4j is a highly scalable robust (fully ACID) native graph

database

bull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015

httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015

httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph

Analytics By Kenny Bastani November 3 2014

httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

50

3 Integration YARN

bull YARN Yet Another Resource Negotiator Implicit

reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND

20summary20~20yarn20AND20status203D20OPEN20ORDER20

BY20priority20DESC0A

bull Some issues are critical ones

bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

51

3 Integration

bull Spark SQL provides built in support for Hivetables

bull Import relational data from Hive tables

bull Run SQL queries over imported data

bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120

bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib

52

3 Integration

bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to

address new use cases

bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data

sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query

in-memory data in Spark Embed Drill execution in a

Spark data pipeline

Source Whats Coming in 2015 for

Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

53

3 Integration

bull Apache Kafka is a high throughput distributed

messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka

Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming

Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-

example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka

54

3 Integration

bull Apache Flume is a streaming event data

ingestion system that is designed for Big Data

ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with

Flume There are two approaches to this

bull Approach 1 Flume-style Push-based Approach

bull Approach 2 (Experimental) Pull-based

Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

55

3 Integration

bull Spark SQL provides built in support for JSON that

is vastly simplifying the end-to-end-experience of

working with JSON data

bull Spark SQL can automatically infer the schema

of a JSON dataset and load it as a

SchemaRDD No more DDL Just point Spark

SQL to JSON files and query Starting Spark 13

SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-

support-in-spark-sqlhtml

56

3 Integration

bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows to

bull Import relational data from Parquet files

bull Run SQL queries over imported data

bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

57

3 Integration

bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015

bull Problem

bull Various inbound data sets

bull Data Layout can change without notice

bull New data sets can be added without notice

Result

bull Leverage Spark to dynamically split the data

bull Leverage Avro to store the data in a compact binary format

58

3 Integration Kite SDK

bull The Kite SDK provides high level abstractions to

work with datasets on Hadoop hiding many of

the details of compression codecs file formats

partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016

release so Spark jobs can read and write to Kite

datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

59

3 Integration

bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between

Elasticsearch and Apache Spark in the form of RDD that can

read data from Elasticsearch Also any RDD can be saved to

Elasticsearch as long as its content can be translated into

documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache

Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html

60

3 Integration

bull Apache Solr added a Spark-based indexing tool for

fast and easy indexing ingestion and serving

searchable complex data ldquoCrunchIndexerTool on

Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark

Crunch and Morphlines

bull Migrate ingestion of HDFS data into Solr from

MapReduce to Spark

bull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-

intosolrusingsparktrimmed

61

3 Integration

bull HUE is the open source Apache Hadoop Web UI

that lets users use Hadoop directly from their

browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark

Igniter lets users execute and monitor Spark jobs

directly from their browser and be more

productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-

hadoop-by-enrico-berti-at-big-data-spain-2014

62

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

63

4 Complementarity

Components of Hadoop ecosystem and Spark ecosystem

can work together each for what it is especially good at

rather than choosing one of them

64

Hadoop ecosystem Spark ecosystem

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

65

4 Complementarity +

bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesos

bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41

66

4 Complementarity +

References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN

cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache

Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-

resource-management

bull YARN vs MESOS Canrsquot We All Just Get

Along httpstrataconfcombig-data-conference-ca-

2015publicscheduledetail40620

67

4 Complementarity +

bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization

strategies (building the DAG with knowledge of data

distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the

need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with

YARN (resource chaining in clusters)

bull Tez supports enterprise security

68

4 Complementarity +

bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better

since it is more ldquostream orientedrdquo has more mature

shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory

parsed data it can be much better when we process

data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native

YARN Integration httphortonworkscomblogimproving-spark-data-

pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

69

4 Complementarity

bull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal

compute framework at each step in the big data

analytics process based on the type of platform the

attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on

November 13 2014 with Matt Schumpert Director of Product

Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution

Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-

right-execution-enginehtml

70

4 Complementarity

bull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles

Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-

2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to

Mainstream Apache Hadoop Adoption February 12 2015

httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-

Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms

February 23 2015

httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-

migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015

httpblogsyncsortcom201503framework-future-hadoop

71

5 Key Takeaways1 Evolution of compute models is still ongoing

Watch out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

72

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

73

1 File System

Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml

bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift

74

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 24: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

1 Evolution

bull This is how Hadoop MapReduce is branding itself ldquoA YARN-based system for parallel processing of large data sets httphadoopapacheorg

bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip

bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job

bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics

24

1 Evolution

bull Tez Hindi for ldquospeedrdquo

bull This is how Apache Tez is branding itself ldquoTheApache Tez project is aimed at building anapplication framework which allows for a complexdirected-acyclic-graph of tasks for processingdata It is currently built atop YARNrdquo

Source httptezapacheorg

bull Apachetrade Tez is an extensible framework for

building high performance batch and

interactive data processing applicationscoordinated by YARN in Apache Hadoop

25

1 Evolution

bull lsquoSparkrsquo for lightning fast speed

bull This is how Apache Spark is branding itselfldquoApache Sparktrade is a fast and general engine forlarge-scale data processingrdquo httpssparkapacheorg

bull Apache Spark is a general purpose clustercomputing framework its execution modelsupports wide variety of use cases batchinteractive near-real time

bull The rapid in-memory processing of resilientdistributed datasets (RDDs) is the ldquocorecapabilityrdquo of Apache Spark

26

1 Evolution Apache Flink

bull Flink German for ldquonimble swift speedyrdquo

bull This is how Apache Flink is branding itself ldquoFast andreliable large-scale data processing enginerdquo

bull Apache Flink httpflinkapacheorg offers

bull Batch and Streaming in the same system

bull Beyond DAGs (Cyclic operator graphs)

bull Powerful expressive APIs

bull Inside-the-system iterations

bull Full Hadoop compatibility

bull Automatic language independent optimizer

bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink

27

Hadoop MapReduce vs Tez vs Spark

Criteria

License Open Source

Apache 20 version

2x

Open Source

Apache 20

version 0x

Open Source

Apache 20 version

1x

Processing

Model

On-Disk (Disk-

based

parallelization)

Batch

On-Disk Batch

Interactive

In-Memory On-Disk

Batch Interactive

Streaming (Near Real-

Time)

Language written

in

Java Java Scala

API [Java Python

Scala] User-Facing

Java[

ISVEngineTool

builder]

[Scala Java Python]

User-Facing

Libraries None separate tools None [Spark Core Spark

Streaming Spark SQL

MLlib GraphX]

28

Hadoop MapReduce vs Tez vs Spark

Criteria

Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to

Hadoop

Ease of Use Difficult to program

needs abstractions

No Interactive mode

except Hive Pig

Difficult to program

No Interactive

mode except Hive

Pig

Easy to program

no need of

abstractions

Interactive mode

Compatibilit

y

to data types and data

sources is same

to data types and

data sources is

same

to data types and

data sources is

same

YARN

integration

YARN application Ground up YARN

application

Spark is moving

towards YARN

29

Hadoop MapReduce vs Tez vs Spark

Criteria

Deployment YARN YARN [Standalone YARN

SIMR Mesos hellip]

Performance - Good performance

when data fits into

memory

- performance

degradation otherwise

Security More features and

projects

More

features and

projects

Still in its infancy

30

Partial support

IV Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

31

2 Transition

bull Existing Hadoop MapReduce projects can

migrate to Spark and leverage Spark Core as

execution engine

1 You can often reuse your mapper and

reducer functions and just call them in

Spark from Java or Scala

2 You can translate your code from

MapReduce to Apache Spark How-to

Translate from MapReduce to Apache Sparkhttpblogclouderacomblog201409how-to-translate-from-mapreduce-to-

apache-spark

32

2 Transition

3 The following tools originally based on Hadoop

MapReduce are being ported to Apache Spark

bull Pig Hive Sqoop Cascading Crunch Mahout hellip

33

Pig on Spark (Spork)

bull Run Pig with ldquondashx sparkrdquo option for an easy migration

without development effort

bull Speed up your existing pig scripts on Spark ( Query

Logical Plan Physical Pan)

bull Leverage new Spark specific operators in Pig such as

Cache

bull Still leverage many existing Pig UDF libraries

bull Pig on Spark Umbrella Jira (Status Passed end-to-end test

cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059

bull Fix outstanding issues and address additional Spark functionality

through the community

bull lsquoPig on Sparkrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag19

34

Hive on Spark (Currently in Beta

Expected in Hive 110)

bull New alternative to using MapReduce or Tez

hivegt set hiveexecutionengine=spark

bull Help existing Hive applications running on

MapReduce or Tez easily migrate to Spark without

development effort

bull Exposes Spark users to a viable feature-rich de facto

standard SQL tool on Hadoop

bull Performance benefits especially for Hive queries

involving multiple reducer stages

bull Hive on Spark Umbrella Jira (Status Open) Q1 2015httpsissuesapacheorgjirabrowseHIVE-7292

35

Hive on Spark (Currently in Beta

Expected in Hive 110)

bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-

motivations-and-design-principles

bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Start

ed

bull Hive on Spark February 11 2015 Szehon Ho

Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark

bull Hive on spark is blazing fast or is it Carter Shanklin and

Mostapah Mokhtar (Hortonworks) February 20 2015httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final

bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12

36

Sqoop on Spark

(Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially

developed as a tool to transfer data from RDBMS to

Hadoop

bull The next version of Sqoop referred to as Sqoop2

supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under

discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Pro

posal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira

Status Work In Progress) The goal of this ticket is to support a

pluggable way to select the execution engine on which we can run

the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

37

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application

development platform for building data applications on

Hadoop

bull Support for Apache Spark is on the roadmap and will be

available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the

transition from CascadingScalding to Spark a little

easier by adding support for Cascading Taps Scalding

Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark

38

Apache Crunch

bull The Apache Crunch Java library provides a

framework for writing testing and running

MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a

SparkPipeline class making it easy to migrate

data processing applications from MapReduce

to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark

Pipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-

xtopicscdh_ig_running_crunch_with_sparkhtml

39

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce

Apache Mahout the original Machine Learning (ML)

library for Hadoop since 2009 is rejecting new

MapReduce algorithm

implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark

bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this

DSL are automatically optimized and executed in

parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for

Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

40

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov

April 2014

httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with

Mahout Scala and Spark Published on May 30 2014

httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-

with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)-

MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

41

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

42

3 IntegrationService Open Source Tool

StorageServi

ng Layer

Data Formats

Data

Ingestion

Services

Resource

Management

Search

SQL

43

3 Integration

bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851

44

3 Integration

bull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via

newAPIHadoopRDD Example HBaseTestscala from

Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach

esparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available

for reading from and writing to HBase without the need

of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with

Spark Status Still in experimentation and no timetable for

possible support httpblogclouderacomblog201412new-in-cloudera-

labs-sparkonhbase

45

3 Integration

bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark

RDDs to Cassandra tables and execute arbitrary CQL

queries in your Spark applications Supports also

integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration

is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra

46

3 Integration

bull Benchmark of Spark amp Cassandra Integration

using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume

data from Cassandra to spark and store Resilient

Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new

avenues

bull Kindling An Introduction to Spark with Cassandra

(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-

spark-with-cassandra

47

3 Integration

bull MongoDB is not directly served by Spark although

it can be used from Spark via an official Mongo-

Hadoop connector

bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-

insights

bull Spark SQL also provides indirect support via its

support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

48

3 Integration

bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from

Apache Spark (still experimental)

bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-

introduction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-

example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-

example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without

Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

49

3 Integration

bull Neo4j is a highly scalable robust (fully ACID) native graph

database

bull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015

httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015

httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph

Analytics By Kenny Bastani November 3 2014

httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

50

3 Integration YARN

bull YARN Yet Another Resource Negotiator Implicit

reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND

20summary20~20yarn20AND20status203D20OPEN20ORDER20

BY20priority20DESC0A

bull Some issues are critical ones

bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

51

3 Integration

bull Spark SQL provides built in support for Hivetables

bull Import relational data from Hive tables

bull Run SQL queries over imported data

bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120

bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib

52

3 Integration

bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to

address new use cases

bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data

sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query

in-memory data in Spark Embed Drill execution in a

Spark data pipeline

Source Whats Coming in 2015 for

Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

53

3 Integration

bull Apache Kafka is a high throughput distributed

messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka

Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming

Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-

example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka

54

3 Integration

bull Apache Flume is a streaming event data

ingestion system that is designed for Big Data

ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with

Flume There are two approaches to this

bull Approach 1 Flume-style Push-based Approach

bull Approach 2 (Experimental) Pull-based

Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

55

3 Integration

bull Spark SQL provides built in support for JSON that

is vastly simplifying the end-to-end-experience of

working with JSON data

bull Spark SQL can automatically infer the schema

of a JSON dataset and load it as a

SchemaRDD No more DDL Just point Spark

SQL to JSON files and query Starting Spark 13

SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-

support-in-spark-sqlhtml

56

3 Integration

bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows to

bull Import relational data from Parquet files

bull Run SQL queries over imported data

bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

57

3 Integration

bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015

bull Problem

bull Various inbound data sets

bull Data Layout can change without notice

bull New data sets can be added without notice

Result

bull Leverage Spark to dynamically split the data

bull Leverage Avro to store the data in a compact binary format

58

3 Integration Kite SDK

bull The Kite SDK provides high level abstractions to

work with datasets on Hadoop hiding many of

the details of compression codecs file formats

partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016

release so Spark jobs can read and write to Kite

datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

59

3 Integration

bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between

Elasticsearch and Apache Spark in the form of RDD that can

read data from Elasticsearch Also any RDD can be saved to

Elasticsearch as long as its content can be translated into

documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache

Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html

60

3 Integration

bull Apache Solr added a Spark-based indexing tool for

fast and easy indexing ingestion and serving

searchable complex data ldquoCrunchIndexerTool on

Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark

Crunch and Morphlines

bull Migrate ingestion of HDFS data into Solr from

MapReduce to Spark

bull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-

intosolrusingsparktrimmed

61

3 Integration

bull HUE is the open source Apache Hadoop Web UI

that lets users use Hadoop directly from their

browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark

Igniter lets users execute and monitor Spark jobs

directly from their browser and be more

productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-

hadoop-by-enrico-berti-at-big-data-spain-2014

62

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

63

4 Complementarity

Components of Hadoop ecosystem and Spark ecosystem

can work together each for what it is especially good at

rather than choosing one of them

64

Hadoop ecosystem Spark ecosystem

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

65

4 Complementarity +

bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesos

bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41

66

4 Complementarity +

References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN

cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache

Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-

resource-management

bull YARN vs MESOS Canrsquot We All Just Get

Along httpstrataconfcombig-data-conference-ca-

2015publicscheduledetail40620

67

4 Complementarity +

bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization

strategies (building the DAG with knowledge of data

distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the

need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with

YARN (resource chaining in clusters)

bull Tez supports enterprise security

68

4 Complementarity +

bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better

since it is more ldquostream orientedrdquo has more mature

shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory

parsed data it can be much better when we process

data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native

YARN Integration httphortonworkscomblogimproving-spark-data-

pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

69

4 Complementarity

bull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal

compute framework at each step in the big data

analytics process based on the type of platform the

attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on

November 13 2014 with Matt Schumpert Director of Product

Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution

Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-

right-execution-enginehtml

70

4 Complementarity

bull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles

Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-

2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to

Mainstream Apache Hadoop Adoption February 12 2015

httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-

Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms

February 23 2015

httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-

migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015

httpblogsyncsortcom201503framework-future-hadoop

71

5 Key Takeaways1 Evolution of compute models is still ongoing

Watch out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

72

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

73

1 File System

Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml

bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift

74

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 25: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

1 Evolution

bull Tez Hindi for ldquospeedrdquo

bull This is how Apache Tez is branding itself ldquoTheApache Tez project is aimed at building anapplication framework which allows for a complexdirected-acyclic-graph of tasks for processingdata It is currently built atop YARNrdquo

Source httptezapacheorg

bull Apachetrade Tez is an extensible framework for

building high performance batch and

interactive data processing applicationscoordinated by YARN in Apache Hadoop

25

1 Evolution

bull lsquoSparkrsquo for lightning fast speed

bull This is how Apache Spark is branding itselfldquoApache Sparktrade is a fast and general engine forlarge-scale data processingrdquo httpssparkapacheorg

bull Apache Spark is a general purpose clustercomputing framework its execution modelsupports wide variety of use cases batchinteractive near-real time

bull The rapid in-memory processing of resilientdistributed datasets (RDDs) is the ldquocorecapabilityrdquo of Apache Spark

26

1 Evolution Apache Flink

bull Flink German for ldquonimble swift speedyrdquo

bull This is how Apache Flink is branding itself ldquoFast andreliable large-scale data processing enginerdquo

bull Apache Flink httpflinkapacheorg offers

bull Batch and Streaming in the same system

bull Beyond DAGs (Cyclic operator graphs)

bull Powerful expressive APIs

bull Inside-the-system iterations

bull Full Hadoop compatibility

bull Automatic language independent optimizer

bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink

27

Hadoop MapReduce vs Tez vs Spark

Criteria

License Open Source

Apache 20 version

2x

Open Source

Apache 20

version 0x

Open Source

Apache 20 version

1x

Processing

Model

On-Disk (Disk-

based

parallelization)

Batch

On-Disk Batch

Interactive

In-Memory On-Disk

Batch Interactive

Streaming (Near Real-

Time)

Language written

in

Java Java Scala

API [Java Python

Scala] User-Facing

Java[

ISVEngineTool

builder]

[Scala Java Python]

User-Facing

Libraries None separate tools None [Spark Core Spark

Streaming Spark SQL

MLlib GraphX]

28

Hadoop MapReduce vs Tez vs Spark

Criteria

Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to

Hadoop

Ease of Use Difficult to program

needs abstractions

No Interactive mode

except Hive Pig

Difficult to program

No Interactive

mode except Hive

Pig

Easy to program

no need of

abstractions

Interactive mode

Compatibilit

y

to data types and data

sources is same

to data types and

data sources is

same

to data types and

data sources is

same

YARN

integration

YARN application Ground up YARN

application

Spark is moving

towards YARN

29

Hadoop MapReduce vs Tez vs Spark

Criteria

Deployment YARN YARN [Standalone YARN

SIMR Mesos hellip]

Performance - Good performance

when data fits into

memory

- performance

degradation otherwise

Security More features and

projects

More

features and

projects

Still in its infancy

30

Partial support

IV Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

31

2 Transition

bull Existing Hadoop MapReduce projects can

migrate to Spark and leverage Spark Core as

execution engine

1 You can often reuse your mapper and

reducer functions and just call them in

Spark from Java or Scala

2 You can translate your code from

MapReduce to Apache Spark How-to

Translate from MapReduce to Apache Sparkhttpblogclouderacomblog201409how-to-translate-from-mapreduce-to-

apache-spark

32

2 Transition

3 The following tools originally based on Hadoop

MapReduce are being ported to Apache Spark

bull Pig Hive Sqoop Cascading Crunch Mahout hellip

33

Pig on Spark (Spork)

bull Run Pig with ldquondashx sparkrdquo option for an easy migration

without development effort

bull Speed up your existing pig scripts on Spark ( Query

Logical Plan Physical Pan)

bull Leverage new Spark specific operators in Pig such as

Cache

bull Still leverage many existing Pig UDF libraries

bull Pig on Spark Umbrella Jira (Status Passed end-to-end test

cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059

bull Fix outstanding issues and address additional Spark functionality

through the community

bull lsquoPig on Sparkrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag19

34

Hive on Spark (Currently in Beta

Expected in Hive 110)

bull New alternative to using MapReduce or Tez

hivegt set hiveexecutionengine=spark

bull Help existing Hive applications running on

MapReduce or Tez easily migrate to Spark without

development effort

bull Exposes Spark users to a viable feature-rich de facto

standard SQL tool on Hadoop

bull Performance benefits especially for Hive queries

involving multiple reducer stages

bull Hive on Spark Umbrella Jira (Status Open) Q1 2015httpsissuesapacheorgjirabrowseHIVE-7292

35

Hive on Spark (Currently in Beta

Expected in Hive 110)

bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-

motivations-and-design-principles

bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Start

ed

bull Hive on Spark February 11 2015 Szehon Ho

Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark

bull Hive on spark is blazing fast or is it Carter Shanklin and

Mostapah Mokhtar (Hortonworks) February 20 2015httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final

bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12

36

Sqoop on Spark

(Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially

developed as a tool to transfer data from RDBMS to

Hadoop

bull The next version of Sqoop referred to as Sqoop2

supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under

discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Pro

posal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira

Status Work In Progress) The goal of this ticket is to support a

pluggable way to select the execution engine on which we can run

the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

37

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application

development platform for building data applications on

Hadoop

bull Support for Apache Spark is on the roadmap and will be

available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the

transition from CascadingScalding to Spark a little

easier by adding support for Cascading Taps Scalding

Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark

38

Apache Crunch

bull The Apache Crunch Java library provides a

framework for writing testing and running

MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a

SparkPipeline class making it easy to migrate

data processing applications from MapReduce

to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark

Pipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-

xtopicscdh_ig_running_crunch_with_sparkhtml

39

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce

Apache Mahout the original Machine Learning (ML)

library for Hadoop since 2009 is rejecting new

MapReduce algorithm

implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark

bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this

DSL are automatically optimized and executed in

parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for

Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

40

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov

April 2014

httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with

Mahout Scala and Spark Published on May 30 2014

httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-

with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)-

MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

41

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

42

3 IntegrationService Open Source Tool

StorageServi

ng Layer

Data Formats

Data

Ingestion

Services

Resource

Management

Search

SQL

43

3 Integration

bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851

44

3 Integration

bull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via

newAPIHadoopRDD Example HBaseTestscala from

Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach

esparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available

for reading from and writing to HBase without the need

of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with

Spark Status Still in experimentation and no timetable for

possible support httpblogclouderacomblog201412new-in-cloudera-

labs-sparkonhbase

45

3 Integration

bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark

RDDs to Cassandra tables and execute arbitrary CQL

queries in your Spark applications Supports also

integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration

is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra

46

3 Integration

bull Benchmark of Spark amp Cassandra Integration

using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume

data from Cassandra to spark and store Resilient

Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new

avenues

bull Kindling An Introduction to Spark with Cassandra

(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-

spark-with-cassandra

47

3 Integration

bull MongoDB is not directly served by Spark although

it can be used from Spark via an official Mongo-

Hadoop connector

bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-

insights

bull Spark SQL also provides indirect support via its

support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

48

3 Integration

bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from

Apache Spark (still experimental)

bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-

introduction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-

example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-

example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without

Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

49

3 Integration

bull Neo4j is a highly scalable robust (fully ACID) native graph

database

bull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015

httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015

httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph

Analytics By Kenny Bastani November 3 2014

httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

50

3 Integration YARN

bull YARN Yet Another Resource Negotiator Implicit

reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND

20summary20~20yarn20AND20status203D20OPEN20ORDER20

BY20priority20DESC0A

bull Some issues are critical ones

bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

51

3 Integration

bull Spark SQL provides built in support for Hivetables

bull Import relational data from Hive tables

bull Run SQL queries over imported data

bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120

bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib

52

3 Integration

bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to

address new use cases

bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data

sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query

in-memory data in Spark Embed Drill execution in a

Spark data pipeline

Source Whats Coming in 2015 for

Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

53

3 Integration

bull Apache Kafka is a high throughput distributed

messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka

Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming

Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-

example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka

54

3 Integration

bull Apache Flume is a streaming event data

ingestion system that is designed for Big Data

ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with

Flume There are two approaches to this

bull Approach 1 Flume-style Push-based Approach

bull Approach 2 (Experimental) Pull-based

Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

55

3 Integration

bull Spark SQL provides built in support for JSON that

is vastly simplifying the end-to-end-experience of

working with JSON data

bull Spark SQL can automatically infer the schema

of a JSON dataset and load it as a

SchemaRDD No more DDL Just point Spark

SQL to JSON files and query Starting Spark 13

SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-

support-in-spark-sqlhtml

56

3 Integration

bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows to

bull Import relational data from Parquet files

bull Run SQL queries over imported data

bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

57

3 Integration

bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015

bull Problem

bull Various inbound data sets

bull Data Layout can change without notice

bull New data sets can be added without notice

Result

bull Leverage Spark to dynamically split the data

bull Leverage Avro to store the data in a compact binary format

58

3 Integration Kite SDK

bull The Kite SDK provides high level abstractions to

work with datasets on Hadoop hiding many of

the details of compression codecs file formats

partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016

release so Spark jobs can read and write to Kite

datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

59

3 Integration

bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between

Elasticsearch and Apache Spark in the form of RDD that can

read data from Elasticsearch Also any RDD can be saved to

Elasticsearch as long as its content can be translated into

documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache

Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html

60

3 Integration

bull Apache Solr added a Spark-based indexing tool for

fast and easy indexing ingestion and serving

searchable complex data ldquoCrunchIndexerTool on

Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark

Crunch and Morphlines

bull Migrate ingestion of HDFS data into Solr from

MapReduce to Spark

bull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-

intosolrusingsparktrimmed

61

3 Integration

bull HUE is the open source Apache Hadoop Web UI

that lets users use Hadoop directly from their

browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark

Igniter lets users execute and monitor Spark jobs

directly from their browser and be more

productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-

hadoop-by-enrico-berti-at-big-data-spain-2014

62

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

63

4 Complementarity

Components of Hadoop ecosystem and Spark ecosystem

can work together each for what it is especially good at

rather than choosing one of them

64

Hadoop ecosystem Spark ecosystem

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

65

4 Complementarity +

bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesos

bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41

66

4 Complementarity +

References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN

cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache

Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-

resource-management

bull YARN vs MESOS Canrsquot We All Just Get

Along httpstrataconfcombig-data-conference-ca-

2015publicscheduledetail40620

67

4 Complementarity +

bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization

strategies (building the DAG with knowledge of data

distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the

need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with

YARN (resource chaining in clusters)

bull Tez supports enterprise security

68

4 Complementarity +

bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better

since it is more ldquostream orientedrdquo has more mature

shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory

parsed data it can be much better when we process

data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native

YARN Integration httphortonworkscomblogimproving-spark-data-

pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

69

4 Complementarity

bull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal

compute framework at each step in the big data

analytics process based on the type of platform the

attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on

November 13 2014 with Matt Schumpert Director of Product

Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution

Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-

right-execution-enginehtml

70

4 Complementarity

bull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles

Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-

2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to

Mainstream Apache Hadoop Adoption February 12 2015

httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-

Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms

February 23 2015

httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-

migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015

httpblogsyncsortcom201503framework-future-hadoop

71

5 Key Takeaways1 Evolution of compute models is still ongoing

Watch out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

72

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

73

1 File System

Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml

bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift

74

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 26: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

1 Evolution

bull lsquoSparkrsquo for lightning fast speed

bull This is how Apache Spark is branding itselfldquoApache Sparktrade is a fast and general engine forlarge-scale data processingrdquo httpssparkapacheorg

bull Apache Spark is a general purpose clustercomputing framework its execution modelsupports wide variety of use cases batchinteractive near-real time

bull The rapid in-memory processing of resilientdistributed datasets (RDDs) is the ldquocorecapabilityrdquo of Apache Spark

26

1 Evolution Apache Flink

bull Flink German for ldquonimble swift speedyrdquo

bull This is how Apache Flink is branding itself ldquoFast andreliable large-scale data processing enginerdquo

bull Apache Flink httpflinkapacheorg offers

bull Batch and Streaming in the same system

bull Beyond DAGs (Cyclic operator graphs)

bull Powerful expressive APIs

bull Inside-the-system iterations

bull Full Hadoop compatibility

bull Automatic language independent optimizer

bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink

27

Hadoop MapReduce vs Tez vs Spark

Criteria

License Open Source

Apache 20 version

2x

Open Source

Apache 20

version 0x

Open Source

Apache 20 version

1x

Processing

Model

On-Disk (Disk-

based

parallelization)

Batch

On-Disk Batch

Interactive

In-Memory On-Disk

Batch Interactive

Streaming (Near Real-

Time)

Language written

in

Java Java Scala

API [Java Python

Scala] User-Facing

Java[

ISVEngineTool

builder]

[Scala Java Python]

User-Facing

Libraries None separate tools None [Spark Core Spark

Streaming Spark SQL

MLlib GraphX]

28

Hadoop MapReduce vs Tez vs Spark

Criteria

Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to

Hadoop

Ease of Use Difficult to program

needs abstractions

No Interactive mode

except Hive Pig

Difficult to program

No Interactive

mode except Hive

Pig

Easy to program

no need of

abstractions

Interactive mode

Compatibilit

y

to data types and data

sources is same

to data types and

data sources is

same

to data types and

data sources is

same

YARN

integration

YARN application Ground up YARN

application

Spark is moving

towards YARN

29

Hadoop MapReduce vs Tez vs Spark

Criteria

Deployment YARN YARN [Standalone YARN

SIMR Mesos hellip]

Performance - Good performance

when data fits into

memory

- performance

degradation otherwise

Security More features and

projects

More

features and

projects

Still in its infancy

30

Partial support

IV Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

31

2 Transition

bull Existing Hadoop MapReduce projects can

migrate to Spark and leverage Spark Core as

execution engine

1 You can often reuse your mapper and

reducer functions and just call them in

Spark from Java or Scala

2 You can translate your code from

MapReduce to Apache Spark How-to

Translate from MapReduce to Apache Sparkhttpblogclouderacomblog201409how-to-translate-from-mapreduce-to-

apache-spark

32

2 Transition

3 The following tools originally based on Hadoop

MapReduce are being ported to Apache Spark

bull Pig Hive Sqoop Cascading Crunch Mahout hellip

33

Pig on Spark (Spork)

bull Run Pig with ldquondashx sparkrdquo option for an easy migration

without development effort

bull Speed up your existing pig scripts on Spark ( Query

Logical Plan Physical Pan)

bull Leverage new Spark specific operators in Pig such as

Cache

bull Still leverage many existing Pig UDF libraries

bull Pig on Spark Umbrella Jira (Status Passed end-to-end test

cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059

bull Fix outstanding issues and address additional Spark functionality

through the community

bull lsquoPig on Sparkrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag19

34

Hive on Spark (Currently in Beta

Expected in Hive 110)

bull New alternative to using MapReduce or Tez

hivegt set hiveexecutionengine=spark

bull Help existing Hive applications running on

MapReduce or Tez easily migrate to Spark without

development effort

bull Exposes Spark users to a viable feature-rich de facto

standard SQL tool on Hadoop

bull Performance benefits especially for Hive queries

involving multiple reducer stages

bull Hive on Spark Umbrella Jira (Status Open) Q1 2015httpsissuesapacheorgjirabrowseHIVE-7292

35

Hive on Spark (Currently in Beta

Expected in Hive 110)

bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-

motivations-and-design-principles

bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Start

ed

bull Hive on Spark February 11 2015 Szehon Ho

Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark

bull Hive on spark is blazing fast or is it Carter Shanklin and

Mostapah Mokhtar (Hortonworks) February 20 2015httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final

bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12

36

Sqoop on Spark

(Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially

developed as a tool to transfer data from RDBMS to

Hadoop

bull The next version of Sqoop referred to as Sqoop2

supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under

discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Pro

posal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira

Status Work In Progress) The goal of this ticket is to support a

pluggable way to select the execution engine on which we can run

the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

37

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application

development platform for building data applications on

Hadoop

bull Support for Apache Spark is on the roadmap and will be

available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the

transition from CascadingScalding to Spark a little

easier by adding support for Cascading Taps Scalding

Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark

38

Apache Crunch

bull The Apache Crunch Java library provides a

framework for writing testing and running

MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a

SparkPipeline class making it easy to migrate

data processing applications from MapReduce

to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark

Pipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-

xtopicscdh_ig_running_crunch_with_sparkhtml

39

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce

Apache Mahout the original Machine Learning (ML)

library for Hadoop since 2009 is rejecting new

MapReduce algorithm

implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark

bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this

DSL are automatically optimized and executed in

parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for

Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

40

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov

April 2014

httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with

Mahout Scala and Spark Published on May 30 2014

httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-

with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)-

MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

41

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

42

3 IntegrationService Open Source Tool

StorageServi

ng Layer

Data Formats

Data

Ingestion

Services

Resource

Management

Search

SQL

43

3 Integration

bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851

44

3 Integration

bull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via

newAPIHadoopRDD Example HBaseTestscala from

Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach

esparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available

for reading from and writing to HBase without the need

of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with

Spark Status Still in experimentation and no timetable for

possible support httpblogclouderacomblog201412new-in-cloudera-

labs-sparkonhbase

45

3 Integration

bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark

RDDs to Cassandra tables and execute arbitrary CQL

queries in your Spark applications Supports also

integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration

is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra

46

3 Integration

bull Benchmark of Spark amp Cassandra Integration

using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume

data from Cassandra to spark and store Resilient

Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new

avenues

bull Kindling An Introduction to Spark with Cassandra

(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-

spark-with-cassandra

47

3 Integration

bull MongoDB is not directly served by Spark although

it can be used from Spark via an official Mongo-

Hadoop connector

bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-

insights

bull Spark SQL also provides indirect support via its

support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

48

3 Integration

bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from

Apache Spark (still experimental)

bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-

introduction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-

example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-

example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without

Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

49

3 Integration

bull Neo4j is a highly scalable robust (fully ACID) native graph

database

bull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015

httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015

httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph

Analytics By Kenny Bastani November 3 2014

httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

50

3 Integration YARN

bull YARN Yet Another Resource Negotiator Implicit

reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND

20summary20~20yarn20AND20status203D20OPEN20ORDER20

BY20priority20DESC0A

bull Some issues are critical ones

bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

51

3 Integration

bull Spark SQL provides built in support for Hivetables

bull Import relational data from Hive tables

bull Run SQL queries over imported data

bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120

bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib

52

3 Integration

bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to

address new use cases

bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data

sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query

in-memory data in Spark Embed Drill execution in a

Spark data pipeline

Source Whats Coming in 2015 for

Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

53

3 Integration

bull Apache Kafka is a high throughput distributed

messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka

Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming

Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-

example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka

54

3 Integration

bull Apache Flume is a streaming event data

ingestion system that is designed for Big Data

ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with

Flume There are two approaches to this

bull Approach 1 Flume-style Push-based Approach

bull Approach 2 (Experimental) Pull-based

Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

55

3 Integration

bull Spark SQL provides built in support for JSON that

is vastly simplifying the end-to-end-experience of

working with JSON data

bull Spark SQL can automatically infer the schema

of a JSON dataset and load it as a

SchemaRDD No more DDL Just point Spark

SQL to JSON files and query Starting Spark 13

SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-

support-in-spark-sqlhtml

56

3 Integration

bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows to

bull Import relational data from Parquet files

bull Run SQL queries over imported data

bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

57

3 Integration

bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015

bull Problem

bull Various inbound data sets

bull Data Layout can change without notice

bull New data sets can be added without notice

Result

bull Leverage Spark to dynamically split the data

bull Leverage Avro to store the data in a compact binary format

58

3 Integration Kite SDK

bull The Kite SDK provides high level abstractions to

work with datasets on Hadoop hiding many of

the details of compression codecs file formats

partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016

release so Spark jobs can read and write to Kite

datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

59

3 Integration

bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between

Elasticsearch and Apache Spark in the form of RDD that can

read data from Elasticsearch Also any RDD can be saved to

Elasticsearch as long as its content can be translated into

documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache

Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html

60

3 Integration

bull Apache Solr added a Spark-based indexing tool for

fast and easy indexing ingestion and serving

searchable complex data ldquoCrunchIndexerTool on

Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark

Crunch and Morphlines

bull Migrate ingestion of HDFS data into Solr from

MapReduce to Spark

bull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-

intosolrusingsparktrimmed

61

3 Integration

bull HUE is the open source Apache Hadoop Web UI

that lets users use Hadoop directly from their

browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark

Igniter lets users execute and monitor Spark jobs

directly from their browser and be more

productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-

hadoop-by-enrico-berti-at-big-data-spain-2014

62

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

63

4 Complementarity

Components of Hadoop ecosystem and Spark ecosystem

can work together each for what it is especially good at

rather than choosing one of them

64

Hadoop ecosystem Spark ecosystem

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

65

4 Complementarity +

bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesos

bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41

66

4 Complementarity +

References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN

cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache

Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-

resource-management

bull YARN vs MESOS Canrsquot We All Just Get

Along httpstrataconfcombig-data-conference-ca-

2015publicscheduledetail40620

67

4 Complementarity +

bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization

strategies (building the DAG with knowledge of data

distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the

need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with

YARN (resource chaining in clusters)

bull Tez supports enterprise security

68

4 Complementarity +

bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better

since it is more ldquostream orientedrdquo has more mature

shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory

parsed data it can be much better when we process

data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native

YARN Integration httphortonworkscomblogimproving-spark-data-

pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

69

4 Complementarity

bull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal

compute framework at each step in the big data

analytics process based on the type of platform the

attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on

November 13 2014 with Matt Schumpert Director of Product

Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution

Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-

right-execution-enginehtml

70

4 Complementarity

bull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles

Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-

2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to

Mainstream Apache Hadoop Adoption February 12 2015

httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-

Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms

February 23 2015

httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-

migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015

httpblogsyncsortcom201503framework-future-hadoop

71

5 Key Takeaways1 Evolution of compute models is still ongoing

Watch out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

72

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

73

1 File System

Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml

bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift

74

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 27: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

1 Evolution Apache Flink

bull Flink German for ldquonimble swift speedyrdquo

bull This is how Apache Flink is branding itself ldquoFast andreliable large-scale data processing enginerdquo

bull Apache Flink httpflinkapacheorg offers

bull Batch and Streaming in the same system

bull Beyond DAGs (Cyclic operator graphs)

bull Powerful expressive APIs

bull Inside-the-system iterations

bull Full Hadoop compatibility

bull Automatic language independent optimizer

bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink

27

Hadoop MapReduce vs Tez vs Spark

Criteria

License Open Source

Apache 20 version

2x

Open Source

Apache 20

version 0x

Open Source

Apache 20 version

1x

Processing

Model

On-Disk (Disk-

based

parallelization)

Batch

On-Disk Batch

Interactive

In-Memory On-Disk

Batch Interactive

Streaming (Near Real-

Time)

Language written

in

Java Java Scala

API [Java Python

Scala] User-Facing

Java[

ISVEngineTool

builder]

[Scala Java Python]

User-Facing

Libraries None separate tools None [Spark Core Spark

Streaming Spark SQL

MLlib GraphX]

28

Hadoop MapReduce vs Tez vs Spark

Criteria

Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to

Hadoop

Ease of Use Difficult to program

needs abstractions

No Interactive mode

except Hive Pig

Difficult to program

No Interactive

mode except Hive

Pig

Easy to program

no need of

abstractions

Interactive mode

Compatibilit

y

to data types and data

sources is same

to data types and

data sources is

same

to data types and

data sources is

same

YARN

integration

YARN application Ground up YARN

application

Spark is moving

towards YARN

29

Hadoop MapReduce vs Tez vs Spark

Criteria

Deployment YARN YARN [Standalone YARN

SIMR Mesos hellip]

Performance - Good performance

when data fits into

memory

- performance

degradation otherwise

Security More features and

projects

More

features and

projects

Still in its infancy

30

Partial support

IV Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

31

2 Transition

bull Existing Hadoop MapReduce projects can

migrate to Spark and leverage Spark Core as

execution engine

1 You can often reuse your mapper and

reducer functions and just call them in

Spark from Java or Scala

2 You can translate your code from

MapReduce to Apache Spark How-to

Translate from MapReduce to Apache Sparkhttpblogclouderacomblog201409how-to-translate-from-mapreduce-to-

apache-spark

32

2 Transition

3 The following tools originally based on Hadoop

MapReduce are being ported to Apache Spark

bull Pig Hive Sqoop Cascading Crunch Mahout hellip

33

Pig on Spark (Spork)

bull Run Pig with ldquondashx sparkrdquo option for an easy migration

without development effort

bull Speed up your existing pig scripts on Spark ( Query

Logical Plan Physical Pan)

bull Leverage new Spark specific operators in Pig such as

Cache

bull Still leverage many existing Pig UDF libraries

bull Pig on Spark Umbrella Jira (Status Passed end-to-end test

cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059

bull Fix outstanding issues and address additional Spark functionality

through the community

bull lsquoPig on Sparkrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag19

34

Hive on Spark (Currently in Beta

Expected in Hive 110)

bull New alternative to using MapReduce or Tez

hivegt set hiveexecutionengine=spark

bull Help existing Hive applications running on

MapReduce or Tez easily migrate to Spark without

development effort

bull Exposes Spark users to a viable feature-rich de facto

standard SQL tool on Hadoop

bull Performance benefits especially for Hive queries

involving multiple reducer stages

bull Hive on Spark Umbrella Jira (Status Open) Q1 2015httpsissuesapacheorgjirabrowseHIVE-7292

35

Hive on Spark (Currently in Beta

Expected in Hive 110)

bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-

motivations-and-design-principles

bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Start

ed

bull Hive on Spark February 11 2015 Szehon Ho

Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark

bull Hive on spark is blazing fast or is it Carter Shanklin and

Mostapah Mokhtar (Hortonworks) February 20 2015httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final

bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12

36

Sqoop on Spark

(Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially

developed as a tool to transfer data from RDBMS to

Hadoop

bull The next version of Sqoop referred to as Sqoop2

supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under

discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Pro

posal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira

Status Work In Progress) The goal of this ticket is to support a

pluggable way to select the execution engine on which we can run

the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

37

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application

development platform for building data applications on

Hadoop

bull Support for Apache Spark is on the roadmap and will be

available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the

transition from CascadingScalding to Spark a little

easier by adding support for Cascading Taps Scalding

Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark

38

Apache Crunch

bull The Apache Crunch Java library provides a

framework for writing testing and running

MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a

SparkPipeline class making it easy to migrate

data processing applications from MapReduce

to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark

Pipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-

xtopicscdh_ig_running_crunch_with_sparkhtml

39

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce

Apache Mahout the original Machine Learning (ML)

library for Hadoop since 2009 is rejecting new

MapReduce algorithm

implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark

bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this

DSL are automatically optimized and executed in

parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for

Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

40

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov

April 2014

httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with

Mahout Scala and Spark Published on May 30 2014

httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-

with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)-

MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

41

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

42

3 IntegrationService Open Source Tool

StorageServi

ng Layer

Data Formats

Data

Ingestion

Services

Resource

Management

Search

SQL

43

3 Integration

bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851

44

3 Integration

bull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via

newAPIHadoopRDD Example HBaseTestscala from

Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach

esparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available

for reading from and writing to HBase without the need

of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with

Spark Status Still in experimentation and no timetable for

possible support httpblogclouderacomblog201412new-in-cloudera-

labs-sparkonhbase

45

3 Integration

bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark

RDDs to Cassandra tables and execute arbitrary CQL

queries in your Spark applications Supports also

integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration

is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra

46

3 Integration

bull Benchmark of Spark amp Cassandra Integration

using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume

data from Cassandra to spark and store Resilient

Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new

avenues

bull Kindling An Introduction to Spark with Cassandra

(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-

spark-with-cassandra

47

3 Integration

bull MongoDB is not directly served by Spark although

it can be used from Spark via an official Mongo-

Hadoop connector

bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-

insights

bull Spark SQL also provides indirect support via its

support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

48

3 Integration

bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from

Apache Spark (still experimental)

bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-

introduction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-

example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-

example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without

Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

49

3 Integration

bull Neo4j is a highly scalable robust (fully ACID) native graph

database

bull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015

httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015

httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph

Analytics By Kenny Bastani November 3 2014

httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

50

3 Integration YARN

bull YARN Yet Another Resource Negotiator Implicit

reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND

20summary20~20yarn20AND20status203D20OPEN20ORDER20

BY20priority20DESC0A

bull Some issues are critical ones

bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

51

3 Integration

bull Spark SQL provides built in support for Hivetables

bull Import relational data from Hive tables

bull Run SQL queries over imported data

bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120

bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib

52

3 Integration

bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to

address new use cases

bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data

sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query

in-memory data in Spark Embed Drill execution in a

Spark data pipeline

Source Whats Coming in 2015 for

Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

53

3 Integration

bull Apache Kafka is a high throughput distributed

messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka

Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming

Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-

example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka

54

3 Integration

bull Apache Flume is a streaming event data

ingestion system that is designed for Big Data

ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with

Flume There are two approaches to this

bull Approach 1 Flume-style Push-based Approach

bull Approach 2 (Experimental) Pull-based

Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

55

3 Integration

bull Spark SQL provides built in support for JSON that

is vastly simplifying the end-to-end-experience of

working with JSON data

bull Spark SQL can automatically infer the schema

of a JSON dataset and load it as a

SchemaRDD No more DDL Just point Spark

SQL to JSON files and query Starting Spark 13

SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-

support-in-spark-sqlhtml

56

3 Integration

bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows to

bull Import relational data from Parquet files

bull Run SQL queries over imported data

bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

57

3 Integration

bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015

bull Problem

bull Various inbound data sets

bull Data Layout can change without notice

bull New data sets can be added without notice

Result

bull Leverage Spark to dynamically split the data

bull Leverage Avro to store the data in a compact binary format

58

3 Integration Kite SDK

bull The Kite SDK provides high level abstractions to

work with datasets on Hadoop hiding many of

the details of compression codecs file formats

partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016

release so Spark jobs can read and write to Kite

datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

59

3 Integration

bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between

Elasticsearch and Apache Spark in the form of RDD that can

read data from Elasticsearch Also any RDD can be saved to

Elasticsearch as long as its content can be translated into

documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache

Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html

60

3 Integration

bull Apache Solr added a Spark-based indexing tool for

fast and easy indexing ingestion and serving

searchable complex data ldquoCrunchIndexerTool on

Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark

Crunch and Morphlines

bull Migrate ingestion of HDFS data into Solr from

MapReduce to Spark

bull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-

intosolrusingsparktrimmed

61

3 Integration

bull HUE is the open source Apache Hadoop Web UI

that lets users use Hadoop directly from their

browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark

Igniter lets users execute and monitor Spark jobs

directly from their browser and be more

productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-

hadoop-by-enrico-berti-at-big-data-spain-2014

62

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

63

4 Complementarity

Components of Hadoop ecosystem and Spark ecosystem

can work together each for what it is especially good at

rather than choosing one of them

64

Hadoop ecosystem Spark ecosystem

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

65

4 Complementarity +

bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesos

bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41

66

4 Complementarity +

References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN

cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache

Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-

resource-management

bull YARN vs MESOS Canrsquot We All Just Get

Along httpstrataconfcombig-data-conference-ca-

2015publicscheduledetail40620

67

4 Complementarity +

bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization

strategies (building the DAG with knowledge of data

distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the

need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with

YARN (resource chaining in clusters)

bull Tez supports enterprise security

68

4 Complementarity +

bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better

since it is more ldquostream orientedrdquo has more mature

shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory

parsed data it can be much better when we process

data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native

YARN Integration httphortonworkscomblogimproving-spark-data-

pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

69

4 Complementarity

bull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal

compute framework at each step in the big data

analytics process based on the type of platform the

attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on

November 13 2014 with Matt Schumpert Director of Product

Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution

Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-

right-execution-enginehtml

70

4 Complementarity

bull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles

Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-

2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to

Mainstream Apache Hadoop Adoption February 12 2015

httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-

Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms

February 23 2015

httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-

migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015

httpblogsyncsortcom201503framework-future-hadoop

71

5 Key Takeaways1 Evolution of compute models is still ongoing

Watch out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

72

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

73

1 File System

Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml

bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift

74

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 28: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

Hadoop MapReduce vs Tez vs Spark

Criteria

License Open Source

Apache 20 version

2x

Open Source

Apache 20

version 0x

Open Source

Apache 20 version

1x

Processing

Model

On-Disk (Disk-

based

parallelization)

Batch

On-Disk Batch

Interactive

In-Memory On-Disk

Batch Interactive

Streaming (Near Real-

Time)

Language written

in

Java Java Scala

API [Java Python

Scala] User-Facing

Java[

ISVEngineTool

builder]

[Scala Java Python]

User-Facing

Libraries None separate tools None [Spark Core Spark

Streaming Spark SQL

MLlib GraphX]

28

Hadoop MapReduce vs Tez vs Spark

Criteria

Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to

Hadoop

Ease of Use Difficult to program

needs abstractions

No Interactive mode

except Hive Pig

Difficult to program

No Interactive

mode except Hive

Pig

Easy to program

no need of

abstractions

Interactive mode

Compatibilit

y

to data types and data

sources is same

to data types and

data sources is

same

to data types and

data sources is

same

YARN

integration

YARN application Ground up YARN

application

Spark is moving

towards YARN

29

Hadoop MapReduce vs Tez vs Spark

Criteria

Deployment YARN YARN [Standalone YARN

SIMR Mesos hellip]

Performance - Good performance

when data fits into

memory

- performance

degradation otherwise

Security More features and

projects

More

features and

projects

Still in its infancy

30

Partial support

IV Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

31

2 Transition

bull Existing Hadoop MapReduce projects can

migrate to Spark and leverage Spark Core as

execution engine

1 You can often reuse your mapper and

reducer functions and just call them in

Spark from Java or Scala

2 You can translate your code from

MapReduce to Apache Spark How-to

Translate from MapReduce to Apache Sparkhttpblogclouderacomblog201409how-to-translate-from-mapreduce-to-

apache-spark

32

2 Transition

3 The following tools originally based on Hadoop

MapReduce are being ported to Apache Spark

bull Pig Hive Sqoop Cascading Crunch Mahout hellip

33

Pig on Spark (Spork)

bull Run Pig with ldquondashx sparkrdquo option for an easy migration

without development effort

bull Speed up your existing pig scripts on Spark ( Query

Logical Plan Physical Pan)

bull Leverage new Spark specific operators in Pig such as

Cache

bull Still leverage many existing Pig UDF libraries

bull Pig on Spark Umbrella Jira (Status Passed end-to-end test

cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059

bull Fix outstanding issues and address additional Spark functionality

through the community

bull lsquoPig on Sparkrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag19

34

Hive on Spark (Currently in Beta

Expected in Hive 110)

bull New alternative to using MapReduce or Tez

hivegt set hiveexecutionengine=spark

bull Help existing Hive applications running on

MapReduce or Tez easily migrate to Spark without

development effort

bull Exposes Spark users to a viable feature-rich de facto

standard SQL tool on Hadoop

bull Performance benefits especially for Hive queries

involving multiple reducer stages

bull Hive on Spark Umbrella Jira (Status Open) Q1 2015httpsissuesapacheorgjirabrowseHIVE-7292

35

Hive on Spark (Currently in Beta

Expected in Hive 110)

bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-

motivations-and-design-principles

bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Start

ed

bull Hive on Spark February 11 2015 Szehon Ho

Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark

bull Hive on spark is blazing fast or is it Carter Shanklin and

Mostapah Mokhtar (Hortonworks) February 20 2015httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final

bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12

36

Sqoop on Spark

(Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially

developed as a tool to transfer data from RDBMS to

Hadoop

bull The next version of Sqoop referred to as Sqoop2

supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under

discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Pro

posal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira

Status Work In Progress) The goal of this ticket is to support a

pluggable way to select the execution engine on which we can run

the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

37

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application

development platform for building data applications on

Hadoop

bull Support for Apache Spark is on the roadmap and will be

available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the

transition from CascadingScalding to Spark a little

easier by adding support for Cascading Taps Scalding

Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark

38

Apache Crunch

bull The Apache Crunch Java library provides a

framework for writing testing and running

MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a

SparkPipeline class making it easy to migrate

data processing applications from MapReduce

to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark

Pipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-

xtopicscdh_ig_running_crunch_with_sparkhtml

39

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce

Apache Mahout the original Machine Learning (ML)

library for Hadoop since 2009 is rejecting new

MapReduce algorithm

implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark

bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this

DSL are automatically optimized and executed in

parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for

Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

40

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov

April 2014

httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with

Mahout Scala and Spark Published on May 30 2014

httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-

with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)-

MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

41

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

42

3 IntegrationService Open Source Tool

StorageServi

ng Layer

Data Formats

Data

Ingestion

Services

Resource

Management

Search

SQL

43

3 Integration

bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851

44

3 Integration

bull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via

newAPIHadoopRDD Example HBaseTestscala from

Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach

esparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available

for reading from and writing to HBase without the need

of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with

Spark Status Still in experimentation and no timetable for

possible support httpblogclouderacomblog201412new-in-cloudera-

labs-sparkonhbase

45

3 Integration

bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark

RDDs to Cassandra tables and execute arbitrary CQL

queries in your Spark applications Supports also

integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration

is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra

46

3 Integration

bull Benchmark of Spark amp Cassandra Integration

using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume

data from Cassandra to spark and store Resilient

Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new

avenues

bull Kindling An Introduction to Spark with Cassandra

(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-

spark-with-cassandra

47

3 Integration

bull MongoDB is not directly served by Spark although

it can be used from Spark via an official Mongo-

Hadoop connector

bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-

insights

bull Spark SQL also provides indirect support via its

support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

48

3 Integration

bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from

Apache Spark (still experimental)

bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-

introduction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-

example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-

example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without

Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

49

3 Integration

bull Neo4j is a highly scalable robust (fully ACID) native graph

database

bull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015

httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015

httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph

Analytics By Kenny Bastani November 3 2014

httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

50

3 Integration YARN

bull YARN Yet Another Resource Negotiator Implicit

reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND

20summary20~20yarn20AND20status203D20OPEN20ORDER20

BY20priority20DESC0A

bull Some issues are critical ones

bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

51

3 Integration

bull Spark SQL provides built in support for Hivetables

bull Import relational data from Hive tables

bull Run SQL queries over imported data

bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120

bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib

52

3 Integration

bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to

address new use cases

bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data

sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query

in-memory data in Spark Embed Drill execution in a

Spark data pipeline

Source Whats Coming in 2015 for

Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

53

3 Integration

bull Apache Kafka is a high throughput distributed

messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka

Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming

Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-

example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka

54

3 Integration

bull Apache Flume is a streaming event data

ingestion system that is designed for Big Data

ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with

Flume There are two approaches to this

bull Approach 1 Flume-style Push-based Approach

bull Approach 2 (Experimental) Pull-based

Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

55

3 Integration

bull Spark SQL provides built in support for JSON that

is vastly simplifying the end-to-end-experience of

working with JSON data

bull Spark SQL can automatically infer the schema

of a JSON dataset and load it as a

SchemaRDD No more DDL Just point Spark

SQL to JSON files and query Starting Spark 13

SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-

support-in-spark-sqlhtml

56

3 Integration

bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows to

bull Import relational data from Parquet files

bull Run SQL queries over imported data

bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

57

3 Integration

bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015

bull Problem

bull Various inbound data sets

bull Data Layout can change without notice

bull New data sets can be added without notice

Result

bull Leverage Spark to dynamically split the data

bull Leverage Avro to store the data in a compact binary format

58

3 Integration Kite SDK

bull The Kite SDK provides high level abstractions to

work with datasets on Hadoop hiding many of

the details of compression codecs file formats

partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016

release so Spark jobs can read and write to Kite

datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

59

3 Integration

bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between

Elasticsearch and Apache Spark in the form of RDD that can

read data from Elasticsearch Also any RDD can be saved to

Elasticsearch as long as its content can be translated into

documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache

Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html

60

3 Integration

bull Apache Solr added a Spark-based indexing tool for

fast and easy indexing ingestion and serving

searchable complex data ldquoCrunchIndexerTool on

Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark

Crunch and Morphlines

bull Migrate ingestion of HDFS data into Solr from

MapReduce to Spark

bull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-

intosolrusingsparktrimmed

61

3 Integration

bull HUE is the open source Apache Hadoop Web UI

that lets users use Hadoop directly from their

browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark

Igniter lets users execute and monitor Spark jobs

directly from their browser and be more

productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-

hadoop-by-enrico-berti-at-big-data-spain-2014

62

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

63

4 Complementarity

Components of Hadoop ecosystem and Spark ecosystem

can work together each for what it is especially good at

rather than choosing one of them

64

Hadoop ecosystem Spark ecosystem

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

65

4 Complementarity +

bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesos

bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41

66

4 Complementarity +

References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN

cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache

Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-

resource-management

bull YARN vs MESOS Canrsquot We All Just Get

Along httpstrataconfcombig-data-conference-ca-

2015publicscheduledetail40620

67

4 Complementarity +

bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization

strategies (building the DAG with knowledge of data

distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the

need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with

YARN (resource chaining in clusters)

bull Tez supports enterprise security

68

4 Complementarity +

bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better

since it is more ldquostream orientedrdquo has more mature

shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory

parsed data it can be much better when we process

data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native

YARN Integration httphortonworkscomblogimproving-spark-data-

pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

69

4 Complementarity

bull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal

compute framework at each step in the big data

analytics process based on the type of platform the

attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on

November 13 2014 with Matt Schumpert Director of Product

Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution

Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-

right-execution-enginehtml

70

4 Complementarity

bull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles

Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-

2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to

Mainstream Apache Hadoop Adoption February 12 2015

httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-

Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms

February 23 2015

httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-

migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015

httpblogsyncsortcom201503framework-future-hadoop

71

5 Key Takeaways1 Evolution of compute models is still ongoing

Watch out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

72

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

73

1 File System

Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml

bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift

74

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 29: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

Hadoop MapReduce vs Tez vs Spark

Criteria

Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to

Hadoop

Ease of Use Difficult to program

needs abstractions

No Interactive mode

except Hive Pig

Difficult to program

No Interactive

mode except Hive

Pig

Easy to program

no need of

abstractions

Interactive mode

Compatibilit

y

to data types and data

sources is same

to data types and

data sources is

same

to data types and

data sources is

same

YARN

integration

YARN application Ground up YARN

application

Spark is moving

towards YARN

29

Hadoop MapReduce vs Tez vs Spark

Criteria

Deployment YARN YARN [Standalone YARN

SIMR Mesos hellip]

Performance - Good performance

when data fits into

memory

- performance

degradation otherwise

Security More features and

projects

More

features and

projects

Still in its infancy

30

Partial support

IV Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

31

2 Transition

bull Existing Hadoop MapReduce projects can

migrate to Spark and leverage Spark Core as

execution engine

1 You can often reuse your mapper and

reducer functions and just call them in

Spark from Java or Scala

2 You can translate your code from

MapReduce to Apache Spark How-to

Translate from MapReduce to Apache Sparkhttpblogclouderacomblog201409how-to-translate-from-mapreduce-to-

apache-spark

32

2 Transition

3 The following tools originally based on Hadoop

MapReduce are being ported to Apache Spark

bull Pig Hive Sqoop Cascading Crunch Mahout hellip

33

Pig on Spark (Spork)

bull Run Pig with ldquondashx sparkrdquo option for an easy migration

without development effort

bull Speed up your existing pig scripts on Spark ( Query

Logical Plan Physical Pan)

bull Leverage new Spark specific operators in Pig such as

Cache

bull Still leverage many existing Pig UDF libraries

bull Pig on Spark Umbrella Jira (Status Passed end-to-end test

cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059

bull Fix outstanding issues and address additional Spark functionality

through the community

bull lsquoPig on Sparkrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag19

34

Hive on Spark (Currently in Beta

Expected in Hive 110)

bull New alternative to using MapReduce or Tez

hivegt set hiveexecutionengine=spark

bull Help existing Hive applications running on

MapReduce or Tez easily migrate to Spark without

development effort

bull Exposes Spark users to a viable feature-rich de facto

standard SQL tool on Hadoop

bull Performance benefits especially for Hive queries

involving multiple reducer stages

bull Hive on Spark Umbrella Jira (Status Open) Q1 2015httpsissuesapacheorgjirabrowseHIVE-7292

35

Hive on Spark (Currently in Beta

Expected in Hive 110)

bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-

motivations-and-design-principles

bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Start

ed

bull Hive on Spark February 11 2015 Szehon Ho

Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark

bull Hive on spark is blazing fast or is it Carter Shanklin and

Mostapah Mokhtar (Hortonworks) February 20 2015httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final

bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12

36

Sqoop on Spark

(Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially

developed as a tool to transfer data from RDBMS to

Hadoop

bull The next version of Sqoop referred to as Sqoop2

supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under

discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Pro

posal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira

Status Work In Progress) The goal of this ticket is to support a

pluggable way to select the execution engine on which we can run

the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

37

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application

development platform for building data applications on

Hadoop

bull Support for Apache Spark is on the roadmap and will be

available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the

transition from CascadingScalding to Spark a little

easier by adding support for Cascading Taps Scalding

Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark

38

Apache Crunch

bull The Apache Crunch Java library provides a

framework for writing testing and running

MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a

SparkPipeline class making it easy to migrate

data processing applications from MapReduce

to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark

Pipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-

xtopicscdh_ig_running_crunch_with_sparkhtml

39

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce

Apache Mahout the original Machine Learning (ML)

library for Hadoop since 2009 is rejecting new

MapReduce algorithm

implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark

bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this

DSL are automatically optimized and executed in

parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for

Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

40

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov

April 2014

httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with

Mahout Scala and Spark Published on May 30 2014

httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-

with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)-

MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

41

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

42

3 IntegrationService Open Source Tool

StorageServi

ng Layer

Data Formats

Data

Ingestion

Services

Resource

Management

Search

SQL

43

3 Integration

bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851

44

3 Integration

bull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via

newAPIHadoopRDD Example HBaseTestscala from

Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach

esparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available

for reading from and writing to HBase without the need

of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with

Spark Status Still in experimentation and no timetable for

possible support httpblogclouderacomblog201412new-in-cloudera-

labs-sparkonhbase

45

3 Integration

bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark

RDDs to Cassandra tables and execute arbitrary CQL

queries in your Spark applications Supports also

integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration

is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra

46

3 Integration

bull Benchmark of Spark amp Cassandra Integration

using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume

data from Cassandra to spark and store Resilient

Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new

avenues

bull Kindling An Introduction to Spark with Cassandra

(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-

spark-with-cassandra

47

3 Integration

bull MongoDB is not directly served by Spark although

it can be used from Spark via an official Mongo-

Hadoop connector

bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-

insights

bull Spark SQL also provides indirect support via its

support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

48

3 Integration

bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from

Apache Spark (still experimental)

bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-

introduction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-

example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-

example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without

Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

49

3 Integration

bull Neo4j is a highly scalable robust (fully ACID) native graph

database

bull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015

httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015

httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph

Analytics By Kenny Bastani November 3 2014

httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

50

3 Integration YARN

bull YARN Yet Another Resource Negotiator Implicit

reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND

20summary20~20yarn20AND20status203D20OPEN20ORDER20

BY20priority20DESC0A

bull Some issues are critical ones

bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

51

3 Integration

bull Spark SQL provides built in support for Hivetables

bull Import relational data from Hive tables

bull Run SQL queries over imported data

bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120

bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib

52

3 Integration

bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to

address new use cases

bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data

sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query

in-memory data in Spark Embed Drill execution in a

Spark data pipeline

Source Whats Coming in 2015 for

Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

53

3 Integration

bull Apache Kafka is a high throughput distributed

messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka

Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming

Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-

example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka

54

3 Integration

bull Apache Flume is a streaming event data

ingestion system that is designed for Big Data

ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with

Flume There are two approaches to this

bull Approach 1 Flume-style Push-based Approach

bull Approach 2 (Experimental) Pull-based

Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

55

3 Integration

bull Spark SQL provides built in support for JSON that

is vastly simplifying the end-to-end-experience of

working with JSON data

bull Spark SQL can automatically infer the schema

of a JSON dataset and load it as a

SchemaRDD No more DDL Just point Spark

SQL to JSON files and query Starting Spark 13

SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-

support-in-spark-sqlhtml

56

3 Integration

bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows to

bull Import relational data from Parquet files

bull Run SQL queries over imported data

bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

57

3 Integration

bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015

bull Problem

bull Various inbound data sets

bull Data Layout can change without notice

bull New data sets can be added without notice

Result

bull Leverage Spark to dynamically split the data

bull Leverage Avro to store the data in a compact binary format

58

3 Integration Kite SDK

bull The Kite SDK provides high level abstractions to

work with datasets on Hadoop hiding many of

the details of compression codecs file formats

partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016

release so Spark jobs can read and write to Kite

datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

59

3 Integration

bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between

Elasticsearch and Apache Spark in the form of RDD that can

read data from Elasticsearch Also any RDD can be saved to

Elasticsearch as long as its content can be translated into

documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache

Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html

60

3 Integration

bull Apache Solr added a Spark-based indexing tool for

fast and easy indexing ingestion and serving

searchable complex data ldquoCrunchIndexerTool on

Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark

Crunch and Morphlines

bull Migrate ingestion of HDFS data into Solr from

MapReduce to Spark

bull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-

intosolrusingsparktrimmed

61

3 Integration

bull HUE is the open source Apache Hadoop Web UI

that lets users use Hadoop directly from their

browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark

Igniter lets users execute and monitor Spark jobs

directly from their browser and be more

productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-

hadoop-by-enrico-berti-at-big-data-spain-2014

62

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

63

4 Complementarity

Components of Hadoop ecosystem and Spark ecosystem

can work together each for what it is especially good at

rather than choosing one of them

64

Hadoop ecosystem Spark ecosystem

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

65

4 Complementarity +

bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesos

bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41

66

4 Complementarity +

References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN

cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache

Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-

resource-management

bull YARN vs MESOS Canrsquot We All Just Get

Along httpstrataconfcombig-data-conference-ca-

2015publicscheduledetail40620

67

4 Complementarity +

bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization

strategies (building the DAG with knowledge of data

distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the

need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with

YARN (resource chaining in clusters)

bull Tez supports enterprise security

68

4 Complementarity +

bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better

since it is more ldquostream orientedrdquo has more mature

shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory

parsed data it can be much better when we process

data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native

YARN Integration httphortonworkscomblogimproving-spark-data-

pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

69

4 Complementarity

bull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal

compute framework at each step in the big data

analytics process based on the type of platform the

attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on

November 13 2014 with Matt Schumpert Director of Product

Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution

Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-

right-execution-enginehtml

70

4 Complementarity

bull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles

Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-

2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to

Mainstream Apache Hadoop Adoption February 12 2015

httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-

Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms

February 23 2015

httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-

migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015

httpblogsyncsortcom201503framework-future-hadoop

71

5 Key Takeaways1 Evolution of compute models is still ongoing

Watch out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

72

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

73

1 File System

Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml

bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift

74

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 30: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

Hadoop MapReduce vs Tez vs Spark

Criteria

Deployment YARN YARN [Standalone YARN

SIMR Mesos hellip]

Performance - Good performance

when data fits into

memory

- performance

degradation otherwise

Security More features and

projects

More

features and

projects

Still in its infancy

30

Partial support

IV Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

31

2 Transition

bull Existing Hadoop MapReduce projects can

migrate to Spark and leverage Spark Core as

execution engine

1 You can often reuse your mapper and

reducer functions and just call them in

Spark from Java or Scala

2 You can translate your code from

MapReduce to Apache Spark How-to

Translate from MapReduce to Apache Sparkhttpblogclouderacomblog201409how-to-translate-from-mapreduce-to-

apache-spark

32

2 Transition

3 The following tools originally based on Hadoop

MapReduce are being ported to Apache Spark

bull Pig Hive Sqoop Cascading Crunch Mahout hellip

33

Pig on Spark (Spork)

bull Run Pig with ldquondashx sparkrdquo option for an easy migration

without development effort

bull Speed up your existing pig scripts on Spark ( Query

Logical Plan Physical Pan)

bull Leverage new Spark specific operators in Pig such as

Cache

bull Still leverage many existing Pig UDF libraries

bull Pig on Spark Umbrella Jira (Status Passed end-to-end test

cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059

bull Fix outstanding issues and address additional Spark functionality

through the community

bull lsquoPig on Sparkrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag19

34

Hive on Spark (Currently in Beta

Expected in Hive 110)

bull New alternative to using MapReduce or Tez

hivegt set hiveexecutionengine=spark

bull Help existing Hive applications running on

MapReduce or Tez easily migrate to Spark without

development effort

bull Exposes Spark users to a viable feature-rich de facto

standard SQL tool on Hadoop

bull Performance benefits especially for Hive queries

involving multiple reducer stages

bull Hive on Spark Umbrella Jira (Status Open) Q1 2015httpsissuesapacheorgjirabrowseHIVE-7292

35

Hive on Spark (Currently in Beta

Expected in Hive 110)

bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-

motivations-and-design-principles

bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Start

ed

bull Hive on Spark February 11 2015 Szehon Ho

Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark

bull Hive on spark is blazing fast or is it Carter Shanklin and

Mostapah Mokhtar (Hortonworks) February 20 2015httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final

bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12

36

Sqoop on Spark

(Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially

developed as a tool to transfer data from RDBMS to

Hadoop

bull The next version of Sqoop referred to as Sqoop2

supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under

discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Pro

posal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira

Status Work In Progress) The goal of this ticket is to support a

pluggable way to select the execution engine on which we can run

the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

37

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application

development platform for building data applications on

Hadoop

bull Support for Apache Spark is on the roadmap and will be

available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the

transition from CascadingScalding to Spark a little

easier by adding support for Cascading Taps Scalding

Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark

38

Apache Crunch

bull The Apache Crunch Java library provides a

framework for writing testing and running

MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a

SparkPipeline class making it easy to migrate

data processing applications from MapReduce

to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark

Pipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-

xtopicscdh_ig_running_crunch_with_sparkhtml

39

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce

Apache Mahout the original Machine Learning (ML)

library for Hadoop since 2009 is rejecting new

MapReduce algorithm

implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark

bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this

DSL are automatically optimized and executed in

parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for

Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

40

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov

April 2014

httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with

Mahout Scala and Spark Published on May 30 2014

httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-

with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)-

MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

41

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

42

3 IntegrationService Open Source Tool

StorageServi

ng Layer

Data Formats

Data

Ingestion

Services

Resource

Management

Search

SQL

43

3 Integration

bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851

44

3 Integration

bull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via

newAPIHadoopRDD Example HBaseTestscala from

Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach

esparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available

for reading from and writing to HBase without the need

of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with

Spark Status Still in experimentation and no timetable for

possible support httpblogclouderacomblog201412new-in-cloudera-

labs-sparkonhbase

45

3 Integration

bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark

RDDs to Cassandra tables and execute arbitrary CQL

queries in your Spark applications Supports also

integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration

is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra

46

3 Integration

bull Benchmark of Spark amp Cassandra Integration

using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume

data from Cassandra to spark and store Resilient

Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new

avenues

bull Kindling An Introduction to Spark with Cassandra

(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-

spark-with-cassandra

47

3 Integration

bull MongoDB is not directly served by Spark although

it can be used from Spark via an official Mongo-

Hadoop connector

bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-

insights

bull Spark SQL also provides indirect support via its

support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

48

3 Integration

bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from

Apache Spark (still experimental)

bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-

introduction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-

example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-

example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without

Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

49

3 Integration

bull Neo4j is a highly scalable robust (fully ACID) native graph

database

bull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015

httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015

httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph

Analytics By Kenny Bastani November 3 2014

httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

50

3 Integration YARN

bull YARN Yet Another Resource Negotiator Implicit

reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND

20summary20~20yarn20AND20status203D20OPEN20ORDER20

BY20priority20DESC0A

bull Some issues are critical ones

bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

51

3 Integration

bull Spark SQL provides built in support for Hivetables

bull Import relational data from Hive tables

bull Run SQL queries over imported data

bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120

bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib

52

3 Integration

bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to

address new use cases

bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data

sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query

in-memory data in Spark Embed Drill execution in a

Spark data pipeline

Source Whats Coming in 2015 for

Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

53

3 Integration

bull Apache Kafka is a high throughput distributed

messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka

Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming

Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-

example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka

54

3 Integration

bull Apache Flume is a streaming event data

ingestion system that is designed for Big Data

ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with

Flume There are two approaches to this

bull Approach 1 Flume-style Push-based Approach

bull Approach 2 (Experimental) Pull-based

Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

55

3 Integration

bull Spark SQL provides built in support for JSON that

is vastly simplifying the end-to-end-experience of

working with JSON data

bull Spark SQL can automatically infer the schema

of a JSON dataset and load it as a

SchemaRDD No more DDL Just point Spark

SQL to JSON files and query Starting Spark 13

SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-

support-in-spark-sqlhtml

56

3 Integration

bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows to

bull Import relational data from Parquet files

bull Run SQL queries over imported data

bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

57

3 Integration

bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015

bull Problem

bull Various inbound data sets

bull Data Layout can change without notice

bull New data sets can be added without notice

Result

bull Leverage Spark to dynamically split the data

bull Leverage Avro to store the data in a compact binary format

58

3 Integration Kite SDK

bull The Kite SDK provides high level abstractions to

work with datasets on Hadoop hiding many of

the details of compression codecs file formats

partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016

release so Spark jobs can read and write to Kite

datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

59

3 Integration

bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between

Elasticsearch and Apache Spark in the form of RDD that can

read data from Elasticsearch Also any RDD can be saved to

Elasticsearch as long as its content can be translated into

documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache

Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html

60

3 Integration

bull Apache Solr added a Spark-based indexing tool for

fast and easy indexing ingestion and serving

searchable complex data ldquoCrunchIndexerTool on

Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark

Crunch and Morphlines

bull Migrate ingestion of HDFS data into Solr from

MapReduce to Spark

bull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-

intosolrusingsparktrimmed

61

3 Integration

bull HUE is the open source Apache Hadoop Web UI

that lets users use Hadoop directly from their

browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark

Igniter lets users execute and monitor Spark jobs

directly from their browser and be more

productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-

hadoop-by-enrico-berti-at-big-data-spain-2014

62

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

63

4 Complementarity

Components of Hadoop ecosystem and Spark ecosystem

can work together each for what it is especially good at

rather than choosing one of them

64

Hadoop ecosystem Spark ecosystem

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

65

4 Complementarity +

bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesos

bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41

66

4 Complementarity +

References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN

cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache

Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-

resource-management

bull YARN vs MESOS Canrsquot We All Just Get

Along httpstrataconfcombig-data-conference-ca-

2015publicscheduledetail40620

67

4 Complementarity +

bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization

strategies (building the DAG with knowledge of data

distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the

need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with

YARN (resource chaining in clusters)

bull Tez supports enterprise security

68

4 Complementarity +

bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better

since it is more ldquostream orientedrdquo has more mature

shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory

parsed data it can be much better when we process

data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native

YARN Integration httphortonworkscomblogimproving-spark-data-

pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

69

4 Complementarity

bull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal

compute framework at each step in the big data

analytics process based on the type of platform the

attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on

November 13 2014 with Matt Schumpert Director of Product

Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution

Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-

right-execution-enginehtml

70

4 Complementarity

bull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles

Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-

2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to

Mainstream Apache Hadoop Adoption February 12 2015

httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-

Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms

February 23 2015

httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-

migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015

httpblogsyncsortcom201503framework-future-hadoop

71

5 Key Takeaways1 Evolution of compute models is still ongoing

Watch out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

72

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

73

1 File System

Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml

bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift

74

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 31: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

IV Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

31

2 Transition

bull Existing Hadoop MapReduce projects can

migrate to Spark and leverage Spark Core as

execution engine

1 You can often reuse your mapper and

reducer functions and just call them in

Spark from Java or Scala

2 You can translate your code from

MapReduce to Apache Spark How-to

Translate from MapReduce to Apache Sparkhttpblogclouderacomblog201409how-to-translate-from-mapreduce-to-

apache-spark

32

2 Transition

3 The following tools originally based on Hadoop

MapReduce are being ported to Apache Spark

bull Pig Hive Sqoop Cascading Crunch Mahout hellip

33

Pig on Spark (Spork)

bull Run Pig with ldquondashx sparkrdquo option for an easy migration

without development effort

bull Speed up your existing pig scripts on Spark ( Query

Logical Plan Physical Pan)

bull Leverage new Spark specific operators in Pig such as

Cache

bull Still leverage many existing Pig UDF libraries

bull Pig on Spark Umbrella Jira (Status Passed end-to-end test

cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059

bull Fix outstanding issues and address additional Spark functionality

through the community

bull lsquoPig on Sparkrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag19

34

Hive on Spark (Currently in Beta

Expected in Hive 110)

bull New alternative to using MapReduce or Tez

hivegt set hiveexecutionengine=spark

bull Help existing Hive applications running on

MapReduce or Tez easily migrate to Spark without

development effort

bull Exposes Spark users to a viable feature-rich de facto

standard SQL tool on Hadoop

bull Performance benefits especially for Hive queries

involving multiple reducer stages

bull Hive on Spark Umbrella Jira (Status Open) Q1 2015httpsissuesapacheorgjirabrowseHIVE-7292

35

Hive on Spark (Currently in Beta

Expected in Hive 110)

bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-

motivations-and-design-principles

bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Start

ed

bull Hive on Spark February 11 2015 Szehon Ho

Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark

bull Hive on spark is blazing fast or is it Carter Shanklin and

Mostapah Mokhtar (Hortonworks) February 20 2015httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final

bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12

36

Sqoop on Spark

(Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially

developed as a tool to transfer data from RDBMS to

Hadoop

bull The next version of Sqoop referred to as Sqoop2

supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under

discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Pro

posal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira

Status Work In Progress) The goal of this ticket is to support a

pluggable way to select the execution engine on which we can run

the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

37

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application

development platform for building data applications on

Hadoop

bull Support for Apache Spark is on the roadmap and will be

available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the

transition from CascadingScalding to Spark a little

easier by adding support for Cascading Taps Scalding

Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark

38

Apache Crunch

bull The Apache Crunch Java library provides a

framework for writing testing and running

MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a

SparkPipeline class making it easy to migrate

data processing applications from MapReduce

to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark

Pipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-

xtopicscdh_ig_running_crunch_with_sparkhtml

39

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce

Apache Mahout the original Machine Learning (ML)

library for Hadoop since 2009 is rejecting new

MapReduce algorithm

implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark

bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this

DSL are automatically optimized and executed in

parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for

Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

40

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov

April 2014

httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with

Mahout Scala and Spark Published on May 30 2014

httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-

with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)-

MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

41

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

42

3 IntegrationService Open Source Tool

StorageServi

ng Layer

Data Formats

Data

Ingestion

Services

Resource

Management

Search

SQL

43

3 Integration

bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851

44

3 Integration

bull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via

newAPIHadoopRDD Example HBaseTestscala from

Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach

esparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available

for reading from and writing to HBase without the need

of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with

Spark Status Still in experimentation and no timetable for

possible support httpblogclouderacomblog201412new-in-cloudera-

labs-sparkonhbase

45

3 Integration

bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark

RDDs to Cassandra tables and execute arbitrary CQL

queries in your Spark applications Supports also

integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration

is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra

46

3 Integration

bull Benchmark of Spark amp Cassandra Integration

using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume

data from Cassandra to spark and store Resilient

Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new

avenues

bull Kindling An Introduction to Spark with Cassandra

(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-

spark-with-cassandra

47

3 Integration

bull MongoDB is not directly served by Spark although

it can be used from Spark via an official Mongo-

Hadoop connector

bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-

insights

bull Spark SQL also provides indirect support via its

support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

48

3 Integration

bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from

Apache Spark (still experimental)

bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-

introduction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-

example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-

example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without

Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

49

3 Integration

bull Neo4j is a highly scalable robust (fully ACID) native graph

database

bull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015

httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015

httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph

Analytics By Kenny Bastani November 3 2014

httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

50

3 Integration YARN

bull YARN Yet Another Resource Negotiator Implicit

reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND

20summary20~20yarn20AND20status203D20OPEN20ORDER20

BY20priority20DESC0A

bull Some issues are critical ones

bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

51

3 Integration

bull Spark SQL provides built in support for Hivetables

bull Import relational data from Hive tables

bull Run SQL queries over imported data

bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120

bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib

52

3 Integration

bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to

address new use cases

bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data

sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query

in-memory data in Spark Embed Drill execution in a

Spark data pipeline

Source Whats Coming in 2015 for

Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

53

3 Integration

bull Apache Kafka is a high throughput distributed

messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka

Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming

Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-

example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka

54

3 Integration

bull Apache Flume is a streaming event data

ingestion system that is designed for Big Data

ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with

Flume There are two approaches to this

bull Approach 1 Flume-style Push-based Approach

bull Approach 2 (Experimental) Pull-based

Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

55

3 Integration

bull Spark SQL provides built in support for JSON that

is vastly simplifying the end-to-end-experience of

working with JSON data

bull Spark SQL can automatically infer the schema

of a JSON dataset and load it as a

SchemaRDD No more DDL Just point Spark

SQL to JSON files and query Starting Spark 13

SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-

support-in-spark-sqlhtml

56

3 Integration

bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows to

bull Import relational data from Parquet files

bull Run SQL queries over imported data

bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

57

3 Integration

bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015

bull Problem

bull Various inbound data sets

bull Data Layout can change without notice

bull New data sets can be added without notice

Result

bull Leverage Spark to dynamically split the data

bull Leverage Avro to store the data in a compact binary format

58

3 Integration Kite SDK

bull The Kite SDK provides high level abstractions to

work with datasets on Hadoop hiding many of

the details of compression codecs file formats

partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016

release so Spark jobs can read and write to Kite

datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

59

3 Integration

bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between

Elasticsearch and Apache Spark in the form of RDD that can

read data from Elasticsearch Also any RDD can be saved to

Elasticsearch as long as its content can be translated into

documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache

Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html

60

3 Integration

bull Apache Solr added a Spark-based indexing tool for

fast and easy indexing ingestion and serving

searchable complex data ldquoCrunchIndexerTool on

Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark

Crunch and Morphlines

bull Migrate ingestion of HDFS data into Solr from

MapReduce to Spark

bull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-

intosolrusingsparktrimmed

61

3 Integration

bull HUE is the open source Apache Hadoop Web UI

that lets users use Hadoop directly from their

browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark

Igniter lets users execute and monitor Spark jobs

directly from their browser and be more

productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-

hadoop-by-enrico-berti-at-big-data-spain-2014

62

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

63

4 Complementarity

Components of Hadoop ecosystem and Spark ecosystem

can work together each for what it is especially good at

rather than choosing one of them

64

Hadoop ecosystem Spark ecosystem

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

65

4 Complementarity +

bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesos

bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41

66

4 Complementarity +

References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN

cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache

Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-

resource-management

bull YARN vs MESOS Canrsquot We All Just Get

Along httpstrataconfcombig-data-conference-ca-

2015publicscheduledetail40620

67

4 Complementarity +

bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization

strategies (building the DAG with knowledge of data

distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the

need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with

YARN (resource chaining in clusters)

bull Tez supports enterprise security

68

4 Complementarity +

bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better

since it is more ldquostream orientedrdquo has more mature

shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory

parsed data it can be much better when we process

data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native

YARN Integration httphortonworkscomblogimproving-spark-data-

pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

69

4 Complementarity

bull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal

compute framework at each step in the big data

analytics process based on the type of platform the

attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on

November 13 2014 with Matt Schumpert Director of Product

Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution

Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-

right-execution-enginehtml

70

4 Complementarity

bull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles

Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-

2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to

Mainstream Apache Hadoop Adoption February 12 2015

httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-

Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms

February 23 2015

httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-

migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015

httpblogsyncsortcom201503framework-future-hadoop

71

5 Key Takeaways1 Evolution of compute models is still ongoing

Watch out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

72

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

73

1 File System

Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml

bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift

74

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 32: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

2 Transition

bull Existing Hadoop MapReduce projects can

migrate to Spark and leverage Spark Core as

execution engine

1 You can often reuse your mapper and

reducer functions and just call them in

Spark from Java or Scala

2 You can translate your code from

MapReduce to Apache Spark How-to

Translate from MapReduce to Apache Sparkhttpblogclouderacomblog201409how-to-translate-from-mapreduce-to-

apache-spark

32

2 Transition

3 The following tools originally based on Hadoop

MapReduce are being ported to Apache Spark

bull Pig Hive Sqoop Cascading Crunch Mahout hellip

33

Pig on Spark (Spork)

bull Run Pig with ldquondashx sparkrdquo option for an easy migration

without development effort

bull Speed up your existing pig scripts on Spark ( Query

Logical Plan Physical Pan)

bull Leverage new Spark specific operators in Pig such as

Cache

bull Still leverage many existing Pig UDF libraries

bull Pig on Spark Umbrella Jira (Status Passed end-to-end test

cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059

bull Fix outstanding issues and address additional Spark functionality

through the community

bull lsquoPig on Sparkrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag19

34

Hive on Spark (Currently in Beta

Expected in Hive 110)

bull New alternative to using MapReduce or Tez

hivegt set hiveexecutionengine=spark

bull Help existing Hive applications running on

MapReduce or Tez easily migrate to Spark without

development effort

bull Exposes Spark users to a viable feature-rich de facto

standard SQL tool on Hadoop

bull Performance benefits especially for Hive queries

involving multiple reducer stages

bull Hive on Spark Umbrella Jira (Status Open) Q1 2015httpsissuesapacheorgjirabrowseHIVE-7292

35

Hive on Spark (Currently in Beta

Expected in Hive 110)

bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-

motivations-and-design-principles

bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Start

ed

bull Hive on Spark February 11 2015 Szehon Ho

Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark

bull Hive on spark is blazing fast or is it Carter Shanklin and

Mostapah Mokhtar (Hortonworks) February 20 2015httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final

bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12

36

Sqoop on Spark

(Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially

developed as a tool to transfer data from RDBMS to

Hadoop

bull The next version of Sqoop referred to as Sqoop2

supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under

discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Pro

posal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira

Status Work In Progress) The goal of this ticket is to support a

pluggable way to select the execution engine on which we can run

the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

37

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application

development platform for building data applications on

Hadoop

bull Support for Apache Spark is on the roadmap and will be

available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the

transition from CascadingScalding to Spark a little

easier by adding support for Cascading Taps Scalding

Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark

38

Apache Crunch

bull The Apache Crunch Java library provides a

framework for writing testing and running

MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a

SparkPipeline class making it easy to migrate

data processing applications from MapReduce

to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark

Pipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-

xtopicscdh_ig_running_crunch_with_sparkhtml

39

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce

Apache Mahout the original Machine Learning (ML)

library for Hadoop since 2009 is rejecting new

MapReduce algorithm

implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark

bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this

DSL are automatically optimized and executed in

parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for

Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

40

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov

April 2014

httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with

Mahout Scala and Spark Published on May 30 2014

httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-

with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)-

MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

41

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

42

3 IntegrationService Open Source Tool

StorageServi

ng Layer

Data Formats

Data

Ingestion

Services

Resource

Management

Search

SQL

43

3 Integration

bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851

44

3 Integration

bull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via

newAPIHadoopRDD Example HBaseTestscala from

Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach

esparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available

for reading from and writing to HBase without the need

of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with

Spark Status Still in experimentation and no timetable for

possible support httpblogclouderacomblog201412new-in-cloudera-

labs-sparkonhbase

45

3 Integration

bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark

RDDs to Cassandra tables and execute arbitrary CQL

queries in your Spark applications Supports also

integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration

is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra

46

3 Integration

bull Benchmark of Spark amp Cassandra Integration

using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume

data from Cassandra to spark and store Resilient

Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new

avenues

bull Kindling An Introduction to Spark with Cassandra

(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-

spark-with-cassandra

47

3 Integration

bull MongoDB is not directly served by Spark although

it can be used from Spark via an official Mongo-

Hadoop connector

bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-

insights

bull Spark SQL also provides indirect support via its

support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

48

3 Integration

bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from

Apache Spark (still experimental)

bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-

introduction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-

example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-

example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without

Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

49

3 Integration

bull Neo4j is a highly scalable robust (fully ACID) native graph

database

bull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015

httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015

httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph

Analytics By Kenny Bastani November 3 2014

httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

50

3 Integration YARN

bull YARN Yet Another Resource Negotiator Implicit

reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND

20summary20~20yarn20AND20status203D20OPEN20ORDER20

BY20priority20DESC0A

bull Some issues are critical ones

bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

51

3 Integration

bull Spark SQL provides built in support for Hivetables

bull Import relational data from Hive tables

bull Run SQL queries over imported data

bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120

bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib

52

3 Integration

bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to

address new use cases

bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data

sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query

in-memory data in Spark Embed Drill execution in a

Spark data pipeline

Source Whats Coming in 2015 for

Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

53

3 Integration

bull Apache Kafka is a high throughput distributed

messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka

Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming

Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-

example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka

54

3 Integration

bull Apache Flume is a streaming event data

ingestion system that is designed for Big Data

ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with

Flume There are two approaches to this

bull Approach 1 Flume-style Push-based Approach

bull Approach 2 (Experimental) Pull-based

Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

55

3 Integration

bull Spark SQL provides built in support for JSON that

is vastly simplifying the end-to-end-experience of

working with JSON data

bull Spark SQL can automatically infer the schema

of a JSON dataset and load it as a

SchemaRDD No more DDL Just point Spark

SQL to JSON files and query Starting Spark 13

SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-

support-in-spark-sqlhtml

56

3 Integration

bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows to

bull Import relational data from Parquet files

bull Run SQL queries over imported data

bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

57

3 Integration

bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015

bull Problem

bull Various inbound data sets

bull Data Layout can change without notice

bull New data sets can be added without notice

Result

bull Leverage Spark to dynamically split the data

bull Leverage Avro to store the data in a compact binary format

58

3 Integration Kite SDK

bull The Kite SDK provides high level abstractions to

work with datasets on Hadoop hiding many of

the details of compression codecs file formats

partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016

release so Spark jobs can read and write to Kite

datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

59

3 Integration

bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between

Elasticsearch and Apache Spark in the form of RDD that can

read data from Elasticsearch Also any RDD can be saved to

Elasticsearch as long as its content can be translated into

documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache

Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html

60

3 Integration

bull Apache Solr added a Spark-based indexing tool for

fast and easy indexing ingestion and serving

searchable complex data ldquoCrunchIndexerTool on

Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark

Crunch and Morphlines

bull Migrate ingestion of HDFS data into Solr from

MapReduce to Spark

bull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-

intosolrusingsparktrimmed

61

3 Integration

bull HUE is the open source Apache Hadoop Web UI

that lets users use Hadoop directly from their

browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark

Igniter lets users execute and monitor Spark jobs

directly from their browser and be more

productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-

hadoop-by-enrico-berti-at-big-data-spain-2014

62

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

63

4 Complementarity

Components of Hadoop ecosystem and Spark ecosystem

can work together each for what it is especially good at

rather than choosing one of them

64

Hadoop ecosystem Spark ecosystem

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

65

4 Complementarity +

bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesos

bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41

66

4 Complementarity +

References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN

cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache

Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-

resource-management

bull YARN vs MESOS Canrsquot We All Just Get

Along httpstrataconfcombig-data-conference-ca-

2015publicscheduledetail40620

67

4 Complementarity +

bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization

strategies (building the DAG with knowledge of data

distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the

need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with

YARN (resource chaining in clusters)

bull Tez supports enterprise security

68

4 Complementarity +

bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better

since it is more ldquostream orientedrdquo has more mature

shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory

parsed data it can be much better when we process

data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native

YARN Integration httphortonworkscomblogimproving-spark-data-

pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

69

4 Complementarity

bull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal

compute framework at each step in the big data

analytics process based on the type of platform the

attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on

November 13 2014 with Matt Schumpert Director of Product

Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution

Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-

right-execution-enginehtml

70

4 Complementarity

bull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles

Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-

2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to

Mainstream Apache Hadoop Adoption February 12 2015

httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-

Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms

February 23 2015

httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-

migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015

httpblogsyncsortcom201503framework-future-hadoop

71

5 Key Takeaways1 Evolution of compute models is still ongoing

Watch out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

72

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

73

1 File System

Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml

bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift

74

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 33: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

2 Transition

3 The following tools originally based on Hadoop

MapReduce are being ported to Apache Spark

bull Pig Hive Sqoop Cascading Crunch Mahout hellip

33

Pig on Spark (Spork)

bull Run Pig with ldquondashx sparkrdquo option for an easy migration

without development effort

bull Speed up your existing pig scripts on Spark ( Query

Logical Plan Physical Pan)

bull Leverage new Spark specific operators in Pig such as

Cache

bull Still leverage many existing Pig UDF libraries

bull Pig on Spark Umbrella Jira (Status Passed end-to-end test

cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059

bull Fix outstanding issues and address additional Spark functionality

through the community

bull lsquoPig on Sparkrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag19

34

Hive on Spark (Currently in Beta

Expected in Hive 110)

bull New alternative to using MapReduce or Tez

hivegt set hiveexecutionengine=spark

bull Help existing Hive applications running on

MapReduce or Tez easily migrate to Spark without

development effort

bull Exposes Spark users to a viable feature-rich de facto

standard SQL tool on Hadoop

bull Performance benefits especially for Hive queries

involving multiple reducer stages

bull Hive on Spark Umbrella Jira (Status Open) Q1 2015httpsissuesapacheorgjirabrowseHIVE-7292

35

Hive on Spark (Currently in Beta

Expected in Hive 110)

bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-

motivations-and-design-principles

bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Start

ed

bull Hive on Spark February 11 2015 Szehon Ho

Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark

bull Hive on spark is blazing fast or is it Carter Shanklin and

Mostapah Mokhtar (Hortonworks) February 20 2015httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final

bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12

36

Sqoop on Spark

(Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially

developed as a tool to transfer data from RDBMS to

Hadoop

bull The next version of Sqoop referred to as Sqoop2

supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under

discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Pro

posal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira

Status Work In Progress) The goal of this ticket is to support a

pluggable way to select the execution engine on which we can run

the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

37

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application

development platform for building data applications on

Hadoop

bull Support for Apache Spark is on the roadmap and will be

available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the

transition from CascadingScalding to Spark a little

easier by adding support for Cascading Taps Scalding

Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark

38

Apache Crunch

bull The Apache Crunch Java library provides a

framework for writing testing and running

MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a

SparkPipeline class making it easy to migrate

data processing applications from MapReduce

to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark

Pipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-

xtopicscdh_ig_running_crunch_with_sparkhtml

39

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce

Apache Mahout the original Machine Learning (ML)

library for Hadoop since 2009 is rejecting new

MapReduce algorithm

implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark

bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this

DSL are automatically optimized and executed in

parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for

Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

40

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov

April 2014

httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with

Mahout Scala and Spark Published on May 30 2014

httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-

with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)-

MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

41

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

42

3 IntegrationService Open Source Tool

StorageServi

ng Layer

Data Formats

Data

Ingestion

Services

Resource

Management

Search

SQL

43

3 Integration

bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851

44

3 Integration

bull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via

newAPIHadoopRDD Example HBaseTestscala from

Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach

esparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available

for reading from and writing to HBase without the need

of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with

Spark Status Still in experimentation and no timetable for

possible support httpblogclouderacomblog201412new-in-cloudera-

labs-sparkonhbase

45

3 Integration

bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark

RDDs to Cassandra tables and execute arbitrary CQL

queries in your Spark applications Supports also

integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration

is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra

46

3 Integration

bull Benchmark of Spark amp Cassandra Integration

using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume

data from Cassandra to spark and store Resilient

Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new

avenues

bull Kindling An Introduction to Spark with Cassandra

(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-

spark-with-cassandra

47

3 Integration

bull MongoDB is not directly served by Spark although

it can be used from Spark via an official Mongo-

Hadoop connector

bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-

insights

bull Spark SQL also provides indirect support via its

support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

48

3 Integration

bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from

Apache Spark (still experimental)

bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-

introduction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-

example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-

example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without

Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

49

3 Integration

bull Neo4j is a highly scalable robust (fully ACID) native graph

database

bull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015

httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015

httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph

Analytics By Kenny Bastani November 3 2014

httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

50

3 Integration YARN

bull YARN Yet Another Resource Negotiator Implicit

reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND

20summary20~20yarn20AND20status203D20OPEN20ORDER20

BY20priority20DESC0A

bull Some issues are critical ones

bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

51

3 Integration

bull Spark SQL provides built in support for Hivetables

bull Import relational data from Hive tables

bull Run SQL queries over imported data

bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120

bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib

52

3 Integration

bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to

address new use cases

bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data

sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query

in-memory data in Spark Embed Drill execution in a

Spark data pipeline

Source Whats Coming in 2015 for

Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

53

3 Integration

bull Apache Kafka is a high throughput distributed

messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka

Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming

Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-

example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka

54

3 Integration

bull Apache Flume is a streaming event data

ingestion system that is designed for Big Data

ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with

Flume There are two approaches to this

bull Approach 1 Flume-style Push-based Approach

bull Approach 2 (Experimental) Pull-based

Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

55

3 Integration

bull Spark SQL provides built in support for JSON that

is vastly simplifying the end-to-end-experience of

working with JSON data

bull Spark SQL can automatically infer the schema

of a JSON dataset and load it as a

SchemaRDD No more DDL Just point Spark

SQL to JSON files and query Starting Spark 13

SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-

support-in-spark-sqlhtml

56

3 Integration

bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows to

bull Import relational data from Parquet files

bull Run SQL queries over imported data

bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

57

3 Integration

bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015

bull Problem

bull Various inbound data sets

bull Data Layout can change without notice

bull New data sets can be added without notice

Result

bull Leverage Spark to dynamically split the data

bull Leverage Avro to store the data in a compact binary format

58

3 Integration Kite SDK

bull The Kite SDK provides high level abstractions to

work with datasets on Hadoop hiding many of

the details of compression codecs file formats

partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016

release so Spark jobs can read and write to Kite

datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

59

3 Integration

bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between

Elasticsearch and Apache Spark in the form of RDD that can

read data from Elasticsearch Also any RDD can be saved to

Elasticsearch as long as its content can be translated into

documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache

Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html

60

3 Integration

bull Apache Solr added a Spark-based indexing tool for

fast and easy indexing ingestion and serving

searchable complex data ldquoCrunchIndexerTool on

Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark

Crunch and Morphlines

bull Migrate ingestion of HDFS data into Solr from

MapReduce to Spark

bull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-

intosolrusingsparktrimmed

61

3 Integration

bull HUE is the open source Apache Hadoop Web UI

that lets users use Hadoop directly from their

browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark

Igniter lets users execute and monitor Spark jobs

directly from their browser and be more

productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-

hadoop-by-enrico-berti-at-big-data-spain-2014

62

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

63

4 Complementarity

Components of Hadoop ecosystem and Spark ecosystem

can work together each for what it is especially good at

rather than choosing one of them

64

Hadoop ecosystem Spark ecosystem

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

65

4 Complementarity +

bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesos

bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41

66

4 Complementarity +

References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN

cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache

Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-

resource-management

bull YARN vs MESOS Canrsquot We All Just Get

Along httpstrataconfcombig-data-conference-ca-

2015publicscheduledetail40620

67

4 Complementarity +

bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization

strategies (building the DAG with knowledge of data

distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the

need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with

YARN (resource chaining in clusters)

bull Tez supports enterprise security

68

4 Complementarity +

bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better

since it is more ldquostream orientedrdquo has more mature

shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory

parsed data it can be much better when we process

data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native

YARN Integration httphortonworkscomblogimproving-spark-data-

pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

69

4 Complementarity

bull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal

compute framework at each step in the big data

analytics process based on the type of platform the

attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on

November 13 2014 with Matt Schumpert Director of Product

Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution

Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-

right-execution-enginehtml

70

4 Complementarity

bull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles

Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-

2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to

Mainstream Apache Hadoop Adoption February 12 2015

httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-

Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms

February 23 2015

httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-

migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015

httpblogsyncsortcom201503framework-future-hadoop

71

5 Key Takeaways1 Evolution of compute models is still ongoing

Watch out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

72

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

73

1 File System

Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml

bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift

74

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 34: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

Pig on Spark (Spork)

bull Run Pig with ldquondashx sparkrdquo option for an easy migration

without development effort

bull Speed up your existing pig scripts on Spark ( Query

Logical Plan Physical Pan)

bull Leverage new Spark specific operators in Pig such as

Cache

bull Still leverage many existing Pig UDF libraries

bull Pig on Spark Umbrella Jira (Status Passed end-to-end test

cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059

bull Fix outstanding issues and address additional Spark functionality

through the community

bull lsquoPig on Sparkrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag19

34

Hive on Spark (Currently in Beta

Expected in Hive 110)

bull New alternative to using MapReduce or Tez

hivegt set hiveexecutionengine=spark

bull Help existing Hive applications running on

MapReduce or Tez easily migrate to Spark without

development effort

bull Exposes Spark users to a viable feature-rich de facto

standard SQL tool on Hadoop

bull Performance benefits especially for Hive queries

involving multiple reducer stages

bull Hive on Spark Umbrella Jira (Status Open) Q1 2015httpsissuesapacheorgjirabrowseHIVE-7292

35

Hive on Spark (Currently in Beta

Expected in Hive 110)

bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-

motivations-and-design-principles

bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Start

ed

bull Hive on Spark February 11 2015 Szehon Ho

Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark

bull Hive on spark is blazing fast or is it Carter Shanklin and

Mostapah Mokhtar (Hortonworks) February 20 2015httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final

bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12

36

Sqoop on Spark

(Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially

developed as a tool to transfer data from RDBMS to

Hadoop

bull The next version of Sqoop referred to as Sqoop2

supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under

discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Pro

posal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira

Status Work In Progress) The goal of this ticket is to support a

pluggable way to select the execution engine on which we can run

the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

37

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application

development platform for building data applications on

Hadoop

bull Support for Apache Spark is on the roadmap and will be

available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the

transition from CascadingScalding to Spark a little

easier by adding support for Cascading Taps Scalding

Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark

38

Apache Crunch

bull The Apache Crunch Java library provides a

framework for writing testing and running

MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a

SparkPipeline class making it easy to migrate

data processing applications from MapReduce

to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark

Pipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-

xtopicscdh_ig_running_crunch_with_sparkhtml

39

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce

Apache Mahout the original Machine Learning (ML)

library for Hadoop since 2009 is rejecting new

MapReduce algorithm

implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark

bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this

DSL are automatically optimized and executed in

parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for

Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

40

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov

April 2014

httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with

Mahout Scala and Spark Published on May 30 2014

httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-

with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)-

MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

41

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

42

3 IntegrationService Open Source Tool

StorageServi

ng Layer

Data Formats

Data

Ingestion

Services

Resource

Management

Search

SQL

43

3 Integration

bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851

44

3 Integration

bull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via

newAPIHadoopRDD Example HBaseTestscala from

Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach

esparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available

for reading from and writing to HBase without the need

of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with

Spark Status Still in experimentation and no timetable for

possible support httpblogclouderacomblog201412new-in-cloudera-

labs-sparkonhbase

45

3 Integration

bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark

RDDs to Cassandra tables and execute arbitrary CQL

queries in your Spark applications Supports also

integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration

is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra

46

3 Integration

bull Benchmark of Spark amp Cassandra Integration

using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume

data from Cassandra to spark and store Resilient

Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new

avenues

bull Kindling An Introduction to Spark with Cassandra

(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-

spark-with-cassandra

47

3 Integration

bull MongoDB is not directly served by Spark although

it can be used from Spark via an official Mongo-

Hadoop connector

bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-

insights

bull Spark SQL also provides indirect support via its

support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

48

3 Integration

bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from

Apache Spark (still experimental)

bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-

introduction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-

example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-

example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without

Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

49

3 Integration

bull Neo4j is a highly scalable robust (fully ACID) native graph

database

bull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015

httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015

httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph

Analytics By Kenny Bastani November 3 2014

httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

50

3 Integration YARN

bull YARN Yet Another Resource Negotiator Implicit

reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND

20summary20~20yarn20AND20status203D20OPEN20ORDER20

BY20priority20DESC0A

bull Some issues are critical ones

bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

51

3 Integration

bull Spark SQL provides built in support for Hivetables

bull Import relational data from Hive tables

bull Run SQL queries over imported data

bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120

bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib

52

3 Integration

bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to

address new use cases

bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data

sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query

in-memory data in Spark Embed Drill execution in a

Spark data pipeline

Source Whats Coming in 2015 for

Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

53

3 Integration

bull Apache Kafka is a high throughput distributed

messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka

Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming

Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-

example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka

54

3 Integration

bull Apache Flume is a streaming event data

ingestion system that is designed for Big Data

ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with

Flume There are two approaches to this

bull Approach 1 Flume-style Push-based Approach

bull Approach 2 (Experimental) Pull-based

Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

55

3 Integration

bull Spark SQL provides built in support for JSON that

is vastly simplifying the end-to-end-experience of

working with JSON data

bull Spark SQL can automatically infer the schema

of a JSON dataset and load it as a

SchemaRDD No more DDL Just point Spark

SQL to JSON files and query Starting Spark 13

SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-

support-in-spark-sqlhtml

56

3 Integration

bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows to

bull Import relational data from Parquet files

bull Run SQL queries over imported data

bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

57

3 Integration

bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015

bull Problem

bull Various inbound data sets

bull Data Layout can change without notice

bull New data sets can be added without notice

Result

bull Leverage Spark to dynamically split the data

bull Leverage Avro to store the data in a compact binary format

58

3 Integration Kite SDK

bull The Kite SDK provides high level abstractions to

work with datasets on Hadoop hiding many of

the details of compression codecs file formats

partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016

release so Spark jobs can read and write to Kite

datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

59

3 Integration

bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between

Elasticsearch and Apache Spark in the form of RDD that can

read data from Elasticsearch Also any RDD can be saved to

Elasticsearch as long as its content can be translated into

documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache

Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html

60

3 Integration

bull Apache Solr added a Spark-based indexing tool for

fast and easy indexing ingestion and serving

searchable complex data ldquoCrunchIndexerTool on

Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark

Crunch and Morphlines

bull Migrate ingestion of HDFS data into Solr from

MapReduce to Spark

bull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-

intosolrusingsparktrimmed

61

3 Integration

bull HUE is the open source Apache Hadoop Web UI

that lets users use Hadoop directly from their

browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark

Igniter lets users execute and monitor Spark jobs

directly from their browser and be more

productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-

hadoop-by-enrico-berti-at-big-data-spain-2014

62

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

63

4 Complementarity

Components of Hadoop ecosystem and Spark ecosystem

can work together each for what it is especially good at

rather than choosing one of them

64

Hadoop ecosystem Spark ecosystem

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

65

4 Complementarity +

bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesos

bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41

66

4 Complementarity +

References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN

cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache

Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-

resource-management

bull YARN vs MESOS Canrsquot We All Just Get

Along httpstrataconfcombig-data-conference-ca-

2015publicscheduledetail40620

67

4 Complementarity +

bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization

strategies (building the DAG with knowledge of data

distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the

need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with

YARN (resource chaining in clusters)

bull Tez supports enterprise security

68

4 Complementarity +

bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better

since it is more ldquostream orientedrdquo has more mature

shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory

parsed data it can be much better when we process

data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native

YARN Integration httphortonworkscomblogimproving-spark-data-

pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

69

4 Complementarity

bull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal

compute framework at each step in the big data

analytics process based on the type of platform the

attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on

November 13 2014 with Matt Schumpert Director of Product

Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution

Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-

right-execution-enginehtml

70

4 Complementarity

bull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles

Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-

2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to

Mainstream Apache Hadoop Adoption February 12 2015

httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-

Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms

February 23 2015

httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-

migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015

httpblogsyncsortcom201503framework-future-hadoop

71

5 Key Takeaways1 Evolution of compute models is still ongoing

Watch out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

72

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

73

1 File System

Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml

bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift

74

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 35: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

Hive on Spark (Currently in Beta

Expected in Hive 110)

bull New alternative to using MapReduce or Tez

hivegt set hiveexecutionengine=spark

bull Help existing Hive applications running on

MapReduce or Tez easily migrate to Spark without

development effort

bull Exposes Spark users to a viable feature-rich de facto

standard SQL tool on Hadoop

bull Performance benefits especially for Hive queries

involving multiple reducer stages

bull Hive on Spark Umbrella Jira (Status Open) Q1 2015httpsissuesapacheorgjirabrowseHIVE-7292

35

Hive on Spark (Currently in Beta

Expected in Hive 110)

bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-

motivations-and-design-principles

bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Start

ed

bull Hive on Spark February 11 2015 Szehon Ho

Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark

bull Hive on spark is blazing fast or is it Carter Shanklin and

Mostapah Mokhtar (Hortonworks) February 20 2015httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final

bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12

36

Sqoop on Spark

(Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially

developed as a tool to transfer data from RDBMS to

Hadoop

bull The next version of Sqoop referred to as Sqoop2

supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under

discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Pro

posal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira

Status Work In Progress) The goal of this ticket is to support a

pluggable way to select the execution engine on which we can run

the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

37

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application

development platform for building data applications on

Hadoop

bull Support for Apache Spark is on the roadmap and will be

available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the

transition from CascadingScalding to Spark a little

easier by adding support for Cascading Taps Scalding

Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark

38

Apache Crunch

bull The Apache Crunch Java library provides a

framework for writing testing and running

MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a

SparkPipeline class making it easy to migrate

data processing applications from MapReduce

to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark

Pipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-

xtopicscdh_ig_running_crunch_with_sparkhtml

39

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce

Apache Mahout the original Machine Learning (ML)

library for Hadoop since 2009 is rejecting new

MapReduce algorithm

implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark

bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this

DSL are automatically optimized and executed in

parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for

Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

40

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov

April 2014

httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with

Mahout Scala and Spark Published on May 30 2014

httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-

with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)-

MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

41

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

42

3 IntegrationService Open Source Tool

StorageServi

ng Layer

Data Formats

Data

Ingestion

Services

Resource

Management

Search

SQL

43

3 Integration

bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851

44

3 Integration

bull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via

newAPIHadoopRDD Example HBaseTestscala from

Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach

esparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available

for reading from and writing to HBase without the need

of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with

Spark Status Still in experimentation and no timetable for

possible support httpblogclouderacomblog201412new-in-cloudera-

labs-sparkonhbase

45

3 Integration

bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark

RDDs to Cassandra tables and execute arbitrary CQL

queries in your Spark applications Supports also

integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration

is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra

46

3 Integration

bull Benchmark of Spark amp Cassandra Integration

using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume

data from Cassandra to spark and store Resilient

Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new

avenues

bull Kindling An Introduction to Spark with Cassandra

(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-

spark-with-cassandra

47

3 Integration

bull MongoDB is not directly served by Spark although

it can be used from Spark via an official Mongo-

Hadoop connector

bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-

insights

bull Spark SQL also provides indirect support via its

support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

48

3 Integration

bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from

Apache Spark (still experimental)

bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-

introduction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-

example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-

example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without

Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

49

3 Integration

bull Neo4j is a highly scalable robust (fully ACID) native graph

database

bull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015

httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015

httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph

Analytics By Kenny Bastani November 3 2014

httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

50

3 Integration YARN

bull YARN Yet Another Resource Negotiator Implicit

reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND

20summary20~20yarn20AND20status203D20OPEN20ORDER20

BY20priority20DESC0A

bull Some issues are critical ones

bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

51

3 Integration

bull Spark SQL provides built in support for Hivetables

bull Import relational data from Hive tables

bull Run SQL queries over imported data

bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120

bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib

52

3 Integration

bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to

address new use cases

bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data

sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query

in-memory data in Spark Embed Drill execution in a

Spark data pipeline

Source Whats Coming in 2015 for

Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

53

3 Integration

bull Apache Kafka is a high throughput distributed

messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka

Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming

Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-

example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka

54

3 Integration

bull Apache Flume is a streaming event data

ingestion system that is designed for Big Data

ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with

Flume There are two approaches to this

bull Approach 1 Flume-style Push-based Approach

bull Approach 2 (Experimental) Pull-based

Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

55

3 Integration

bull Spark SQL provides built in support for JSON that

is vastly simplifying the end-to-end-experience of

working with JSON data

bull Spark SQL can automatically infer the schema

of a JSON dataset and load it as a

SchemaRDD No more DDL Just point Spark

SQL to JSON files and query Starting Spark 13

SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-

support-in-spark-sqlhtml

56

3 Integration

bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows to

bull Import relational data from Parquet files

bull Run SQL queries over imported data

bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

57

3 Integration

bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015

bull Problem

bull Various inbound data sets

bull Data Layout can change without notice

bull New data sets can be added without notice

Result

bull Leverage Spark to dynamically split the data

bull Leverage Avro to store the data in a compact binary format

58

3 Integration Kite SDK

bull The Kite SDK provides high level abstractions to

work with datasets on Hadoop hiding many of

the details of compression codecs file formats

partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016

release so Spark jobs can read and write to Kite

datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

59

3 Integration

bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between

Elasticsearch and Apache Spark in the form of RDD that can

read data from Elasticsearch Also any RDD can be saved to

Elasticsearch as long as its content can be translated into

documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache

Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html

60

3 Integration

bull Apache Solr added a Spark-based indexing tool for

fast and easy indexing ingestion and serving

searchable complex data ldquoCrunchIndexerTool on

Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark

Crunch and Morphlines

bull Migrate ingestion of HDFS data into Solr from

MapReduce to Spark

bull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-

intosolrusingsparktrimmed

61

3 Integration

bull HUE is the open source Apache Hadoop Web UI

that lets users use Hadoop directly from their

browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark

Igniter lets users execute and monitor Spark jobs

directly from their browser and be more

productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-

hadoop-by-enrico-berti-at-big-data-spain-2014

62

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

63

4 Complementarity

Components of Hadoop ecosystem and Spark ecosystem

can work together each for what it is especially good at

rather than choosing one of them

64

Hadoop ecosystem Spark ecosystem

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

65

4 Complementarity +

bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesos

bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41

66

4 Complementarity +

References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN

cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache

Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-

resource-management

bull YARN vs MESOS Canrsquot We All Just Get

Along httpstrataconfcombig-data-conference-ca-

2015publicscheduledetail40620

67

4 Complementarity +

bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization

strategies (building the DAG with knowledge of data

distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the

need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with

YARN (resource chaining in clusters)

bull Tez supports enterprise security

68

4 Complementarity +

bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better

since it is more ldquostream orientedrdquo has more mature

shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory

parsed data it can be much better when we process

data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native

YARN Integration httphortonworkscomblogimproving-spark-data-

pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

69

4 Complementarity

bull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal

compute framework at each step in the big data

analytics process based on the type of platform the

attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on

November 13 2014 with Matt Schumpert Director of Product

Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution

Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-

right-execution-enginehtml

70

4 Complementarity

bull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles

Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-

2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to

Mainstream Apache Hadoop Adoption February 12 2015

httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-

Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms

February 23 2015

httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-

migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015

httpblogsyncsortcom201503framework-future-hadoop

71

5 Key Takeaways1 Evolution of compute models is still ongoing

Watch out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

72

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

73

1 File System

Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml

bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift

74

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 36: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

Hive on Spark (Currently in Beta

Expected in Hive 110)

bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-

motivations-and-design-principles

bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Start

ed

bull Hive on Spark February 11 2015 Szehon Ho

Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark

bull Hive on spark is blazing fast or is it Carter Shanklin and

Mostapah Mokhtar (Hortonworks) February 20 2015httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final

bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12

36

Sqoop on Spark

(Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially

developed as a tool to transfer data from RDBMS to

Hadoop

bull The next version of Sqoop referred to as Sqoop2

supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under

discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Pro

posal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira

Status Work In Progress) The goal of this ticket is to support a

pluggable way to select the execution engine on which we can run

the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

37

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application

development platform for building data applications on

Hadoop

bull Support for Apache Spark is on the roadmap and will be

available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the

transition from CascadingScalding to Spark a little

easier by adding support for Cascading Taps Scalding

Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark

38

Apache Crunch

bull The Apache Crunch Java library provides a

framework for writing testing and running

MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a

SparkPipeline class making it easy to migrate

data processing applications from MapReduce

to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark

Pipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-

xtopicscdh_ig_running_crunch_with_sparkhtml

39

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce

Apache Mahout the original Machine Learning (ML)

library for Hadoop since 2009 is rejecting new

MapReduce algorithm

implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark

bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this

DSL are automatically optimized and executed in

parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for

Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

40

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov

April 2014

httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with

Mahout Scala and Spark Published on May 30 2014

httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-

with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)-

MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

41

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

42

3 IntegrationService Open Source Tool

StorageServi

ng Layer

Data Formats

Data

Ingestion

Services

Resource

Management

Search

SQL

43

3 Integration

bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851

44

3 Integration

bull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via

newAPIHadoopRDD Example HBaseTestscala from

Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach

esparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available

for reading from and writing to HBase without the need

of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with

Spark Status Still in experimentation and no timetable for

possible support httpblogclouderacomblog201412new-in-cloudera-

labs-sparkonhbase

45

3 Integration

bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark

RDDs to Cassandra tables and execute arbitrary CQL

queries in your Spark applications Supports also

integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration

is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra

46

3 Integration

bull Benchmark of Spark amp Cassandra Integration

using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume

data from Cassandra to spark and store Resilient

Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new

avenues

bull Kindling An Introduction to Spark with Cassandra

(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-

spark-with-cassandra

47

3 Integration

bull MongoDB is not directly served by Spark although

it can be used from Spark via an official Mongo-

Hadoop connector

bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-

insights

bull Spark SQL also provides indirect support via its

support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

48

3 Integration

bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from

Apache Spark (still experimental)

bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-

introduction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-

example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-

example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without

Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

49

3 Integration

bull Neo4j is a highly scalable robust (fully ACID) native graph

database

bull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015

httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015

httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph

Analytics By Kenny Bastani November 3 2014

httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

50

3 Integration YARN

bull YARN Yet Another Resource Negotiator Implicit

reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND

20summary20~20yarn20AND20status203D20OPEN20ORDER20

BY20priority20DESC0A

bull Some issues are critical ones

bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

51

3 Integration

bull Spark SQL provides built in support for Hivetables

bull Import relational data from Hive tables

bull Run SQL queries over imported data

bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120

bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib

52

3 Integration

bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to

address new use cases

bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data

sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query

in-memory data in Spark Embed Drill execution in a

Spark data pipeline

Source Whats Coming in 2015 for

Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

53

3 Integration

bull Apache Kafka is a high throughput distributed

messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka

Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming

Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-

example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka

54

3 Integration

bull Apache Flume is a streaming event data

ingestion system that is designed for Big Data

ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with

Flume There are two approaches to this

bull Approach 1 Flume-style Push-based Approach

bull Approach 2 (Experimental) Pull-based

Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

55

3 Integration

bull Spark SQL provides built in support for JSON that

is vastly simplifying the end-to-end-experience of

working with JSON data

bull Spark SQL can automatically infer the schema

of a JSON dataset and load it as a

SchemaRDD No more DDL Just point Spark

SQL to JSON files and query Starting Spark 13

SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-

support-in-spark-sqlhtml

56

3 Integration

bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows to

bull Import relational data from Parquet files

bull Run SQL queries over imported data

bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

57

3 Integration

bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015

bull Problem

bull Various inbound data sets

bull Data Layout can change without notice

bull New data sets can be added without notice

Result

bull Leverage Spark to dynamically split the data

bull Leverage Avro to store the data in a compact binary format

58

3 Integration Kite SDK

bull The Kite SDK provides high level abstractions to

work with datasets on Hadoop hiding many of

the details of compression codecs file formats

partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016

release so Spark jobs can read and write to Kite

datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

59

3 Integration

bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between

Elasticsearch and Apache Spark in the form of RDD that can

read data from Elasticsearch Also any RDD can be saved to

Elasticsearch as long as its content can be translated into

documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache

Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html

60

3 Integration

bull Apache Solr added a Spark-based indexing tool for

fast and easy indexing ingestion and serving

searchable complex data ldquoCrunchIndexerTool on

Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark

Crunch and Morphlines

bull Migrate ingestion of HDFS data into Solr from

MapReduce to Spark

bull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-

intosolrusingsparktrimmed

61

3 Integration

bull HUE is the open source Apache Hadoop Web UI

that lets users use Hadoop directly from their

browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark

Igniter lets users execute and monitor Spark jobs

directly from their browser and be more

productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-

hadoop-by-enrico-berti-at-big-data-spain-2014

62

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

63

4 Complementarity

Components of Hadoop ecosystem and Spark ecosystem

can work together each for what it is especially good at

rather than choosing one of them

64

Hadoop ecosystem Spark ecosystem

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

65

4 Complementarity +

bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesos

bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41

66

4 Complementarity +

References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN

cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache

Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-

resource-management

bull YARN vs MESOS Canrsquot We All Just Get

Along httpstrataconfcombig-data-conference-ca-

2015publicscheduledetail40620

67

4 Complementarity +

bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization

strategies (building the DAG with knowledge of data

distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the

need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with

YARN (resource chaining in clusters)

bull Tez supports enterprise security

68

4 Complementarity +

bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better

since it is more ldquostream orientedrdquo has more mature

shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory

parsed data it can be much better when we process

data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native

YARN Integration httphortonworkscomblogimproving-spark-data-

pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

69

4 Complementarity

bull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal

compute framework at each step in the big data

analytics process based on the type of platform the

attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on

November 13 2014 with Matt Schumpert Director of Product

Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution

Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-

right-execution-enginehtml

70

4 Complementarity

bull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles

Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-

2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to

Mainstream Apache Hadoop Adoption February 12 2015

httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-

Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms

February 23 2015

httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-

migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015

httpblogsyncsortcom201503framework-future-hadoop

71

5 Key Takeaways1 Evolution of compute models is still ongoing

Watch out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

72

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

73

1 File System

Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml

bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift

74

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 37: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

Sqoop on Spark

(Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially

developed as a tool to transfer data from RDBMS to

Hadoop

bull The next version of Sqoop referred to as Sqoop2

supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under

discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Pro

posal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira

Status Work In Progress) The goal of this ticket is to support a

pluggable way to select the execution engine on which we can run

the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

37

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application

development platform for building data applications on

Hadoop

bull Support for Apache Spark is on the roadmap and will be

available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the

transition from CascadingScalding to Spark a little

easier by adding support for Cascading Taps Scalding

Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark

38

Apache Crunch

bull The Apache Crunch Java library provides a

framework for writing testing and running

MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a

SparkPipeline class making it easy to migrate

data processing applications from MapReduce

to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark

Pipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-

xtopicscdh_ig_running_crunch_with_sparkhtml

39

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce

Apache Mahout the original Machine Learning (ML)

library for Hadoop since 2009 is rejecting new

MapReduce algorithm

implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark

bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this

DSL are automatically optimized and executed in

parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for

Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

40

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov

April 2014

httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with

Mahout Scala and Spark Published on May 30 2014

httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-

with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)-

MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

41

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

42

3 IntegrationService Open Source Tool

StorageServi

ng Layer

Data Formats

Data

Ingestion

Services

Resource

Management

Search

SQL

43

3 Integration

bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851

44

3 Integration

bull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via

newAPIHadoopRDD Example HBaseTestscala from

Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach

esparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available

for reading from and writing to HBase without the need

of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with

Spark Status Still in experimentation and no timetable for

possible support httpblogclouderacomblog201412new-in-cloudera-

labs-sparkonhbase

45

3 Integration

bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark

RDDs to Cassandra tables and execute arbitrary CQL

queries in your Spark applications Supports also

integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration

is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra

46

3 Integration

bull Benchmark of Spark amp Cassandra Integration

using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume

data from Cassandra to spark and store Resilient

Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new

avenues

bull Kindling An Introduction to Spark with Cassandra

(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-

spark-with-cassandra

47

3 Integration

bull MongoDB is not directly served by Spark although

it can be used from Spark via an official Mongo-

Hadoop connector

bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-

insights

bull Spark SQL also provides indirect support via its

support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

48

3 Integration

bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from

Apache Spark (still experimental)

bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-

introduction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-

example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-

example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without

Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

49

3 Integration

bull Neo4j is a highly scalable robust (fully ACID) native graph

database

bull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015

httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015

httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph

Analytics By Kenny Bastani November 3 2014

httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

50

3 Integration YARN

bull YARN Yet Another Resource Negotiator Implicit

reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND

20summary20~20yarn20AND20status203D20OPEN20ORDER20

BY20priority20DESC0A

bull Some issues are critical ones

bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

51

3 Integration

bull Spark SQL provides built in support for Hivetables

bull Import relational data from Hive tables

bull Run SQL queries over imported data

bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120

bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib

52

3 Integration

bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to

address new use cases

bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data

sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query

in-memory data in Spark Embed Drill execution in a

Spark data pipeline

Source Whats Coming in 2015 for

Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

53

3 Integration

bull Apache Kafka is a high throughput distributed

messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka

Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming

Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-

example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka

54

3 Integration

bull Apache Flume is a streaming event data

ingestion system that is designed for Big Data

ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with

Flume There are two approaches to this

bull Approach 1 Flume-style Push-based Approach

bull Approach 2 (Experimental) Pull-based

Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

55

3 Integration

bull Spark SQL provides built in support for JSON that

is vastly simplifying the end-to-end-experience of

working with JSON data

bull Spark SQL can automatically infer the schema

of a JSON dataset and load it as a

SchemaRDD No more DDL Just point Spark

SQL to JSON files and query Starting Spark 13

SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-

support-in-spark-sqlhtml

56

3 Integration

bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows to

bull Import relational data from Parquet files

bull Run SQL queries over imported data

bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

57

3 Integration

bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015

bull Problem

bull Various inbound data sets

bull Data Layout can change without notice

bull New data sets can be added without notice

Result

bull Leverage Spark to dynamically split the data

bull Leverage Avro to store the data in a compact binary format

58

3 Integration Kite SDK

bull The Kite SDK provides high level abstractions to

work with datasets on Hadoop hiding many of

the details of compression codecs file formats

partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016

release so Spark jobs can read and write to Kite

datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

59

3 Integration

bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between

Elasticsearch and Apache Spark in the form of RDD that can

read data from Elasticsearch Also any RDD can be saved to

Elasticsearch as long as its content can be translated into

documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache

Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html

60

3 Integration

bull Apache Solr added a Spark-based indexing tool for

fast and easy indexing ingestion and serving

searchable complex data ldquoCrunchIndexerTool on

Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark

Crunch and Morphlines

bull Migrate ingestion of HDFS data into Solr from

MapReduce to Spark

bull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-

intosolrusingsparktrimmed

61

3 Integration

bull HUE is the open source Apache Hadoop Web UI

that lets users use Hadoop directly from their

browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark

Igniter lets users execute and monitor Spark jobs

directly from their browser and be more

productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-

hadoop-by-enrico-berti-at-big-data-spain-2014

62

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

63

4 Complementarity

Components of Hadoop ecosystem and Spark ecosystem

can work together each for what it is especially good at

rather than choosing one of them

64

Hadoop ecosystem Spark ecosystem

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

65

4 Complementarity +

bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesos

bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41

66

4 Complementarity +

References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN

cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache

Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-

resource-management

bull YARN vs MESOS Canrsquot We All Just Get

Along httpstrataconfcombig-data-conference-ca-

2015publicscheduledetail40620

67

4 Complementarity +

bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization

strategies (building the DAG with knowledge of data

distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the

need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with

YARN (resource chaining in clusters)

bull Tez supports enterprise security

68

4 Complementarity +

bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better

since it is more ldquostream orientedrdquo has more mature

shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory

parsed data it can be much better when we process

data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native

YARN Integration httphortonworkscomblogimproving-spark-data-

pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

69

4 Complementarity

bull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal

compute framework at each step in the big data

analytics process based on the type of platform the

attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on

November 13 2014 with Matt Schumpert Director of Product

Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution

Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-

right-execution-enginehtml

70

4 Complementarity

bull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles

Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-

2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to

Mainstream Apache Hadoop Adoption February 12 2015

httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-

Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms

February 23 2015

httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-

migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015

httpblogsyncsortcom201503framework-future-hadoop

71

5 Key Takeaways1 Evolution of compute models is still ongoing

Watch out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

72

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

73

1 File System

Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml

bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift

74

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 38: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application

development platform for building data applications on

Hadoop

bull Support for Apache Spark is on the roadmap and will be

available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the

transition from CascadingScalding to Spark a little

easier by adding support for Cascading Taps Scalding

Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark

38

Apache Crunch

bull The Apache Crunch Java library provides a

framework for writing testing and running

MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a

SparkPipeline class making it easy to migrate

data processing applications from MapReduce

to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark

Pipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-

xtopicscdh_ig_running_crunch_with_sparkhtml

39

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce

Apache Mahout the original Machine Learning (ML)

library for Hadoop since 2009 is rejecting new

MapReduce algorithm

implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark

bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this

DSL are automatically optimized and executed in

parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for

Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

40

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov

April 2014

httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with

Mahout Scala and Spark Published on May 30 2014

httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-

with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)-

MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

41

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

42

3 IntegrationService Open Source Tool

StorageServi

ng Layer

Data Formats

Data

Ingestion

Services

Resource

Management

Search

SQL

43

3 Integration

bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851

44

3 Integration

bull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via

newAPIHadoopRDD Example HBaseTestscala from

Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach

esparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available

for reading from and writing to HBase without the need

of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with

Spark Status Still in experimentation and no timetable for

possible support httpblogclouderacomblog201412new-in-cloudera-

labs-sparkonhbase

45

3 Integration

bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark

RDDs to Cassandra tables and execute arbitrary CQL

queries in your Spark applications Supports also

integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration

is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra

46

3 Integration

bull Benchmark of Spark amp Cassandra Integration

using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume

data from Cassandra to spark and store Resilient

Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new

avenues

bull Kindling An Introduction to Spark with Cassandra

(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-

spark-with-cassandra

47

3 Integration

bull MongoDB is not directly served by Spark although

it can be used from Spark via an official Mongo-

Hadoop connector

bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-

insights

bull Spark SQL also provides indirect support via its

support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

48

3 Integration

bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from

Apache Spark (still experimental)

bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-

introduction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-

example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-

example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without

Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

49

3 Integration

bull Neo4j is a highly scalable robust (fully ACID) native graph

database

bull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015

httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015

httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph

Analytics By Kenny Bastani November 3 2014

httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

50

3 Integration YARN

bull YARN Yet Another Resource Negotiator Implicit

reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND

20summary20~20yarn20AND20status203D20OPEN20ORDER20

BY20priority20DESC0A

bull Some issues are critical ones

bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

51

3 Integration

bull Spark SQL provides built in support for Hivetables

bull Import relational data from Hive tables

bull Run SQL queries over imported data

bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120

bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib

52

3 Integration

bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to

address new use cases

bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data

sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query

in-memory data in Spark Embed Drill execution in a

Spark data pipeline

Source Whats Coming in 2015 for

Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

53

3 Integration

bull Apache Kafka is a high throughput distributed

messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka

Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming

Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-

example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka

54

3 Integration

bull Apache Flume is a streaming event data

ingestion system that is designed for Big Data

ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with

Flume There are two approaches to this

bull Approach 1 Flume-style Push-based Approach

bull Approach 2 (Experimental) Pull-based

Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

55

3 Integration

bull Spark SQL provides built in support for JSON that

is vastly simplifying the end-to-end-experience of

working with JSON data

bull Spark SQL can automatically infer the schema

of a JSON dataset and load it as a

SchemaRDD No more DDL Just point Spark

SQL to JSON files and query Starting Spark 13

SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-

support-in-spark-sqlhtml

56

3 Integration

bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows to

bull Import relational data from Parquet files

bull Run SQL queries over imported data

bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

57

3 Integration

bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015

bull Problem

bull Various inbound data sets

bull Data Layout can change without notice

bull New data sets can be added without notice

Result

bull Leverage Spark to dynamically split the data

bull Leverage Avro to store the data in a compact binary format

58

3 Integration Kite SDK

bull The Kite SDK provides high level abstractions to

work with datasets on Hadoop hiding many of

the details of compression codecs file formats

partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016

release so Spark jobs can read and write to Kite

datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

59

3 Integration

bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between

Elasticsearch and Apache Spark in the form of RDD that can

read data from Elasticsearch Also any RDD can be saved to

Elasticsearch as long as its content can be translated into

documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache

Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html

60

3 Integration

bull Apache Solr added a Spark-based indexing tool for

fast and easy indexing ingestion and serving

searchable complex data ldquoCrunchIndexerTool on

Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark

Crunch and Morphlines

bull Migrate ingestion of HDFS data into Solr from

MapReduce to Spark

bull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-

intosolrusingsparktrimmed

61

3 Integration

bull HUE is the open source Apache Hadoop Web UI

that lets users use Hadoop directly from their

browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark

Igniter lets users execute and monitor Spark jobs

directly from their browser and be more

productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-

hadoop-by-enrico-berti-at-big-data-spain-2014

62

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

63

4 Complementarity

Components of Hadoop ecosystem and Spark ecosystem

can work together each for what it is especially good at

rather than choosing one of them

64

Hadoop ecosystem Spark ecosystem

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

65

4 Complementarity +

bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesos

bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41

66

4 Complementarity +

References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN

cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache

Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-

resource-management

bull YARN vs MESOS Canrsquot We All Just Get

Along httpstrataconfcombig-data-conference-ca-

2015publicscheduledetail40620

67

4 Complementarity +

bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization

strategies (building the DAG with knowledge of data

distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the

need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with

YARN (resource chaining in clusters)

bull Tez supports enterprise security

68

4 Complementarity +

bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better

since it is more ldquostream orientedrdquo has more mature

shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory

parsed data it can be much better when we process

data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native

YARN Integration httphortonworkscomblogimproving-spark-data-

pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

69

4 Complementarity

bull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal

compute framework at each step in the big data

analytics process based on the type of platform the

attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on

November 13 2014 with Matt Schumpert Director of Product

Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution

Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-

right-execution-enginehtml

70

4 Complementarity

bull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles

Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-

2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to

Mainstream Apache Hadoop Adoption February 12 2015

httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-

Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms

February 23 2015

httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-

migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015

httpblogsyncsortcom201503framework-future-hadoop

71

5 Key Takeaways1 Evolution of compute models is still ongoing

Watch out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

72

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

73

1 File System

Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml

bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift

74

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 39: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

Apache Crunch

bull The Apache Crunch Java library provides a

framework for writing testing and running

MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a

SparkPipeline class making it easy to migrate

data processing applications from MapReduce

to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark

Pipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-

xtopicscdh_ig_running_crunch_with_sparkhtml

39

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce

Apache Mahout the original Machine Learning (ML)

library for Hadoop since 2009 is rejecting new

MapReduce algorithm

implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark

bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this

DSL are automatically optimized and executed in

parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for

Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

40

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov

April 2014

httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with

Mahout Scala and Spark Published on May 30 2014

httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-

with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)-

MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

41

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

42

3 IntegrationService Open Source Tool

StorageServi

ng Layer

Data Formats

Data

Ingestion

Services

Resource

Management

Search

SQL

43

3 Integration

bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851

44

3 Integration

bull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via

newAPIHadoopRDD Example HBaseTestscala from

Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach

esparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available

for reading from and writing to HBase without the need

of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with

Spark Status Still in experimentation and no timetable for

possible support httpblogclouderacomblog201412new-in-cloudera-

labs-sparkonhbase

45

3 Integration

bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark

RDDs to Cassandra tables and execute arbitrary CQL

queries in your Spark applications Supports also

integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration

is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra

46

3 Integration

bull Benchmark of Spark amp Cassandra Integration

using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume

data from Cassandra to spark and store Resilient

Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new

avenues

bull Kindling An Introduction to Spark with Cassandra

(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-

spark-with-cassandra

47

3 Integration

bull MongoDB is not directly served by Spark although

it can be used from Spark via an official Mongo-

Hadoop connector

bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-

insights

bull Spark SQL also provides indirect support via its

support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

48

3 Integration

bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from

Apache Spark (still experimental)

bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-

introduction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-

example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-

example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without

Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

49

3 Integration

bull Neo4j is a highly scalable robust (fully ACID) native graph

database

bull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015

httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015

httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph

Analytics By Kenny Bastani November 3 2014

httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

50

3 Integration YARN

bull YARN Yet Another Resource Negotiator Implicit

reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND

20summary20~20yarn20AND20status203D20OPEN20ORDER20

BY20priority20DESC0A

bull Some issues are critical ones

bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

51

3 Integration

bull Spark SQL provides built in support for Hivetables

bull Import relational data from Hive tables

bull Run SQL queries over imported data

bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120

bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib

52

3 Integration

bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to

address new use cases

bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data

sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query

in-memory data in Spark Embed Drill execution in a

Spark data pipeline

Source Whats Coming in 2015 for

Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

53

3 Integration

bull Apache Kafka is a high throughput distributed

messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka

Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming

Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-

example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka

54

3 Integration

bull Apache Flume is a streaming event data

ingestion system that is designed for Big Data

ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with

Flume There are two approaches to this

bull Approach 1 Flume-style Push-based Approach

bull Approach 2 (Experimental) Pull-based

Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

55

3 Integration

bull Spark SQL provides built in support for JSON that

is vastly simplifying the end-to-end-experience of

working with JSON data

bull Spark SQL can automatically infer the schema

of a JSON dataset and load it as a

SchemaRDD No more DDL Just point Spark

SQL to JSON files and query Starting Spark 13

SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-

support-in-spark-sqlhtml

56

3 Integration

bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows to

bull Import relational data from Parquet files

bull Run SQL queries over imported data

bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

57

3 Integration

bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015

bull Problem

bull Various inbound data sets

bull Data Layout can change without notice

bull New data sets can be added without notice

Result

bull Leverage Spark to dynamically split the data

bull Leverage Avro to store the data in a compact binary format

58

3 Integration Kite SDK

bull The Kite SDK provides high level abstractions to

work with datasets on Hadoop hiding many of

the details of compression codecs file formats

partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016

release so Spark jobs can read and write to Kite

datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

59

3 Integration

bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between

Elasticsearch and Apache Spark in the form of RDD that can

read data from Elasticsearch Also any RDD can be saved to

Elasticsearch as long as its content can be translated into

documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache

Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html

60

3 Integration

bull Apache Solr added a Spark-based indexing tool for

fast and easy indexing ingestion and serving

searchable complex data ldquoCrunchIndexerTool on

Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark

Crunch and Morphlines

bull Migrate ingestion of HDFS data into Solr from

MapReduce to Spark

bull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-

intosolrusingsparktrimmed

61

3 Integration

bull HUE is the open source Apache Hadoop Web UI

that lets users use Hadoop directly from their

browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark

Igniter lets users execute and monitor Spark jobs

directly from their browser and be more

productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-

hadoop-by-enrico-berti-at-big-data-spain-2014

62

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

63

4 Complementarity

Components of Hadoop ecosystem and Spark ecosystem

can work together each for what it is especially good at

rather than choosing one of them

64

Hadoop ecosystem Spark ecosystem

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

65

4 Complementarity +

bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesos

bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41

66

4 Complementarity +

References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN

cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache

Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-

resource-management

bull YARN vs MESOS Canrsquot We All Just Get

Along httpstrataconfcombig-data-conference-ca-

2015publicscheduledetail40620

67

4 Complementarity +

bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization

strategies (building the DAG with knowledge of data

distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the

need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with

YARN (resource chaining in clusters)

bull Tez supports enterprise security

68

4 Complementarity +

bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better

since it is more ldquostream orientedrdquo has more mature

shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory

parsed data it can be much better when we process

data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native

YARN Integration httphortonworkscomblogimproving-spark-data-

pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

69

4 Complementarity

bull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal

compute framework at each step in the big data

analytics process based on the type of platform the

attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on

November 13 2014 with Matt Schumpert Director of Product

Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution

Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-

right-execution-enginehtml

70

4 Complementarity

bull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles

Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-

2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to

Mainstream Apache Hadoop Adoption February 12 2015

httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-

Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms

February 23 2015

httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-

migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015

httpblogsyncsortcom201503framework-future-hadoop

71

5 Key Takeaways1 Evolution of compute models is still ongoing

Watch out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

72

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

73

1 File System

Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml

bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift

74

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 40: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce

Apache Mahout the original Machine Learning (ML)

library for Hadoop since 2009 is rejecting new

MapReduce algorithm

implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark

bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this

DSL are automatically optimized and executed in

parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for

Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

40

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov

April 2014

httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with

Mahout Scala and Spark Published on May 30 2014

httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-

with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)-

MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

41

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

42

3 IntegrationService Open Source Tool

StorageServi

ng Layer

Data Formats

Data

Ingestion

Services

Resource

Management

Search

SQL

43

3 Integration

bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851

44

3 Integration

bull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via

newAPIHadoopRDD Example HBaseTestscala from

Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach

esparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available

for reading from and writing to HBase without the need

of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with

Spark Status Still in experimentation and no timetable for

possible support httpblogclouderacomblog201412new-in-cloudera-

labs-sparkonhbase

45

3 Integration

bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark

RDDs to Cassandra tables and execute arbitrary CQL

queries in your Spark applications Supports also

integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration

is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra

46

3 Integration

bull Benchmark of Spark amp Cassandra Integration

using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume

data from Cassandra to spark and store Resilient

Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new

avenues

bull Kindling An Introduction to Spark with Cassandra

(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-

spark-with-cassandra

47

3 Integration

bull MongoDB is not directly served by Spark although

it can be used from Spark via an official Mongo-

Hadoop connector

bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-

insights

bull Spark SQL also provides indirect support via its

support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

48

3 Integration

bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from

Apache Spark (still experimental)

bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-

introduction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-

example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-

example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without

Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

49

3 Integration

bull Neo4j is a highly scalable robust (fully ACID) native graph

database

bull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015

httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015

httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph

Analytics By Kenny Bastani November 3 2014

httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

50

3 Integration YARN

bull YARN Yet Another Resource Negotiator Implicit

reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND

20summary20~20yarn20AND20status203D20OPEN20ORDER20

BY20priority20DESC0A

bull Some issues are critical ones

bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

51

3 Integration

bull Spark SQL provides built in support for Hivetables

bull Import relational data from Hive tables

bull Run SQL queries over imported data

bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120

bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib

52

3 Integration

bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to

address new use cases

bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data

sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query

in-memory data in Spark Embed Drill execution in a

Spark data pipeline

Source Whats Coming in 2015 for

Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

53

3 Integration

bull Apache Kafka is a high throughput distributed

messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka

Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming

Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-

example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka

54

3 Integration

bull Apache Flume is a streaming event data

ingestion system that is designed for Big Data

ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with

Flume There are two approaches to this

bull Approach 1 Flume-style Push-based Approach

bull Approach 2 (Experimental) Pull-based

Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

55

3 Integration

bull Spark SQL provides built in support for JSON that

is vastly simplifying the end-to-end-experience of

working with JSON data

bull Spark SQL can automatically infer the schema

of a JSON dataset and load it as a

SchemaRDD No more DDL Just point Spark

SQL to JSON files and query Starting Spark 13

SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-

support-in-spark-sqlhtml

56

3 Integration

bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows to

bull Import relational data from Parquet files

bull Run SQL queries over imported data

bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

57

3 Integration

bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015

bull Problem

bull Various inbound data sets

bull Data Layout can change without notice

bull New data sets can be added without notice

Result

bull Leverage Spark to dynamically split the data

bull Leverage Avro to store the data in a compact binary format

58

3 Integration Kite SDK

bull The Kite SDK provides high level abstractions to

work with datasets on Hadoop hiding many of

the details of compression codecs file formats

partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016

release so Spark jobs can read and write to Kite

datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

59

3 Integration

bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between

Elasticsearch and Apache Spark in the form of RDD that can

read data from Elasticsearch Also any RDD can be saved to

Elasticsearch as long as its content can be translated into

documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache

Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html

60

3 Integration

bull Apache Solr added a Spark-based indexing tool for

fast and easy indexing ingestion and serving

searchable complex data ldquoCrunchIndexerTool on

Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark

Crunch and Morphlines

bull Migrate ingestion of HDFS data into Solr from

MapReduce to Spark

bull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-

intosolrusingsparktrimmed

61

3 Integration

bull HUE is the open source Apache Hadoop Web UI

that lets users use Hadoop directly from their

browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark

Igniter lets users execute and monitor Spark jobs

directly from their browser and be more

productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-

hadoop-by-enrico-berti-at-big-data-spain-2014

62

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

63

4 Complementarity

Components of Hadoop ecosystem and Spark ecosystem

can work together each for what it is especially good at

rather than choosing one of them

64

Hadoop ecosystem Spark ecosystem

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

65

4 Complementarity +

bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesos

bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41

66

4 Complementarity +

References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN

cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache

Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-

resource-management

bull YARN vs MESOS Canrsquot We All Just Get

Along httpstrataconfcombig-data-conference-ca-

2015publicscheduledetail40620

67

4 Complementarity +

bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization

strategies (building the DAG with knowledge of data

distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the

need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with

YARN (resource chaining in clusters)

bull Tez supports enterprise security

68

4 Complementarity +

bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better

since it is more ldquostream orientedrdquo has more mature

shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory

parsed data it can be much better when we process

data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native

YARN Integration httphortonworkscomblogimproving-spark-data-

pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

69

4 Complementarity

bull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal

compute framework at each step in the big data

analytics process based on the type of platform the

attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on

November 13 2014 with Matt Schumpert Director of Product

Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution

Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-

right-execution-enginehtml

70

4 Complementarity

bull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles

Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-

2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to

Mainstream Apache Hadoop Adoption February 12 2015

httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-

Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms

February 23 2015

httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-

migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015

httpblogsyncsortcom201503framework-future-hadoop

71

5 Key Takeaways1 Evolution of compute models is still ongoing

Watch out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

72

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

73

1 File System

Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml

bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift

74

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 41: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov

April 2014

httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with

Mahout Scala and Spark Published on May 30 2014

httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-

with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)-

MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

41

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

42

3 IntegrationService Open Source Tool

StorageServi

ng Layer

Data Formats

Data

Ingestion

Services

Resource

Management

Search

SQL

43

3 Integration

bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851

44

3 Integration

bull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via

newAPIHadoopRDD Example HBaseTestscala from

Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach

esparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available

for reading from and writing to HBase without the need

of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with

Spark Status Still in experimentation and no timetable for

possible support httpblogclouderacomblog201412new-in-cloudera-

labs-sparkonhbase

45

3 Integration

bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark

RDDs to Cassandra tables and execute arbitrary CQL

queries in your Spark applications Supports also

integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration

is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra

46

3 Integration

bull Benchmark of Spark amp Cassandra Integration

using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume

data from Cassandra to spark and store Resilient

Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new

avenues

bull Kindling An Introduction to Spark with Cassandra

(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-

spark-with-cassandra

47

3 Integration

bull MongoDB is not directly served by Spark although

it can be used from Spark via an official Mongo-

Hadoop connector

bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-

insights

bull Spark SQL also provides indirect support via its

support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

48

3 Integration

bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from

Apache Spark (still experimental)

bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-

introduction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-

example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-

example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without

Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

49

3 Integration

bull Neo4j is a highly scalable robust (fully ACID) native graph

database

bull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015

httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015

httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph

Analytics By Kenny Bastani November 3 2014

httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

50

3 Integration YARN

bull YARN Yet Another Resource Negotiator Implicit

reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND

20summary20~20yarn20AND20status203D20OPEN20ORDER20

BY20priority20DESC0A

bull Some issues are critical ones

bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

51

3 Integration

bull Spark SQL provides built in support for Hivetables

bull Import relational data from Hive tables

bull Run SQL queries over imported data

bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120

bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib

52

3 Integration

bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to

address new use cases

bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data

sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query

in-memory data in Spark Embed Drill execution in a

Spark data pipeline

Source Whats Coming in 2015 for

Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

53

3 Integration

bull Apache Kafka is a high throughput distributed

messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka

Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming

Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-

example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka

54

3 Integration

bull Apache Flume is a streaming event data

ingestion system that is designed for Big Data

ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with

Flume There are two approaches to this

bull Approach 1 Flume-style Push-based Approach

bull Approach 2 (Experimental) Pull-based

Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

55

3 Integration

bull Spark SQL provides built in support for JSON that

is vastly simplifying the end-to-end-experience of

working with JSON data

bull Spark SQL can automatically infer the schema

of a JSON dataset and load it as a

SchemaRDD No more DDL Just point Spark

SQL to JSON files and query Starting Spark 13

SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-

support-in-spark-sqlhtml

56

3 Integration

bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows to

bull Import relational data from Parquet files

bull Run SQL queries over imported data

bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

57

3 Integration

bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015

bull Problem

bull Various inbound data sets

bull Data Layout can change without notice

bull New data sets can be added without notice

Result

bull Leverage Spark to dynamically split the data

bull Leverage Avro to store the data in a compact binary format

58

3 Integration Kite SDK

bull The Kite SDK provides high level abstractions to

work with datasets on Hadoop hiding many of

the details of compression codecs file formats

partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016

release so Spark jobs can read and write to Kite

datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

59

3 Integration

bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between

Elasticsearch and Apache Spark in the form of RDD that can

read data from Elasticsearch Also any RDD can be saved to

Elasticsearch as long as its content can be translated into

documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache

Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html

60

3 Integration

bull Apache Solr added a Spark-based indexing tool for

fast and easy indexing ingestion and serving

searchable complex data ldquoCrunchIndexerTool on

Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark

Crunch and Morphlines

bull Migrate ingestion of HDFS data into Solr from

MapReduce to Spark

bull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-

intosolrusingsparktrimmed

61

3 Integration

bull HUE is the open source Apache Hadoop Web UI

that lets users use Hadoop directly from their

browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark

Igniter lets users execute and monitor Spark jobs

directly from their browser and be more

productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-

hadoop-by-enrico-berti-at-big-data-spain-2014

62

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

63

4 Complementarity

Components of Hadoop ecosystem and Spark ecosystem

can work together each for what it is especially good at

rather than choosing one of them

64

Hadoop ecosystem Spark ecosystem

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

65

4 Complementarity +

bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesos

bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41

66

4 Complementarity +

References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN

cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache

Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-

resource-management

bull YARN vs MESOS Canrsquot We All Just Get

Along httpstrataconfcombig-data-conference-ca-

2015publicscheduledetail40620

67

4 Complementarity +

bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization

strategies (building the DAG with knowledge of data

distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the

need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with

YARN (resource chaining in clusters)

bull Tez supports enterprise security

68

4 Complementarity +

bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better

since it is more ldquostream orientedrdquo has more mature

shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory

parsed data it can be much better when we process

data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native

YARN Integration httphortonworkscomblogimproving-spark-data-

pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

69

4 Complementarity

bull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal

compute framework at each step in the big data

analytics process based on the type of platform the

attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on

November 13 2014 with Matt Schumpert Director of Product

Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution

Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-

right-execution-enginehtml

70

4 Complementarity

bull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles

Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-

2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to

Mainstream Apache Hadoop Adoption February 12 2015

httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-

Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms

February 23 2015

httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-

migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015

httpblogsyncsortcom201503framework-future-hadoop

71

5 Key Takeaways1 Evolution of compute models is still ongoing

Watch out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

72

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

73

1 File System

Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml

bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift

74

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 42: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

42

3 IntegrationService Open Source Tool

StorageServi

ng Layer

Data Formats

Data

Ingestion

Services

Resource

Management

Search

SQL

43

3 Integration

bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851

44

3 Integration

bull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via

newAPIHadoopRDD Example HBaseTestscala from

Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach

esparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available

for reading from and writing to HBase without the need

of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with

Spark Status Still in experimentation and no timetable for

possible support httpblogclouderacomblog201412new-in-cloudera-

labs-sparkonhbase

45

3 Integration

bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark

RDDs to Cassandra tables and execute arbitrary CQL

queries in your Spark applications Supports also

integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration

is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra

46

3 Integration

bull Benchmark of Spark amp Cassandra Integration

using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume

data from Cassandra to spark and store Resilient

Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new

avenues

bull Kindling An Introduction to Spark with Cassandra

(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-

spark-with-cassandra

47

3 Integration

bull MongoDB is not directly served by Spark although

it can be used from Spark via an official Mongo-

Hadoop connector

bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-

insights

bull Spark SQL also provides indirect support via its

support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

48

3 Integration

bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from

Apache Spark (still experimental)

bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-

introduction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-

example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-

example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without

Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

49

3 Integration

bull Neo4j is a highly scalable robust (fully ACID) native graph

database

bull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015

httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015

httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph

Analytics By Kenny Bastani November 3 2014

httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

50

3 Integration YARN

bull YARN Yet Another Resource Negotiator Implicit

reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND

20summary20~20yarn20AND20status203D20OPEN20ORDER20

BY20priority20DESC0A

bull Some issues are critical ones

bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

51

3 Integration

bull Spark SQL provides built in support for Hivetables

bull Import relational data from Hive tables

bull Run SQL queries over imported data

bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120

bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib

52

3 Integration

bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to

address new use cases

bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data

sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query

in-memory data in Spark Embed Drill execution in a

Spark data pipeline

Source Whats Coming in 2015 for

Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

53

3 Integration

bull Apache Kafka is a high throughput distributed

messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka

Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming

Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-

example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka

54

3 Integration

bull Apache Flume is a streaming event data

ingestion system that is designed for Big Data

ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with

Flume There are two approaches to this

bull Approach 1 Flume-style Push-based Approach

bull Approach 2 (Experimental) Pull-based

Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

55

3 Integration

bull Spark SQL provides built in support for JSON that

is vastly simplifying the end-to-end-experience of

working with JSON data

bull Spark SQL can automatically infer the schema

of a JSON dataset and load it as a

SchemaRDD No more DDL Just point Spark

SQL to JSON files and query Starting Spark 13

SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-

support-in-spark-sqlhtml

56

3 Integration

bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows to

bull Import relational data from Parquet files

bull Run SQL queries over imported data

bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

57

3 Integration

bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015

bull Problem

bull Various inbound data sets

bull Data Layout can change without notice

bull New data sets can be added without notice

Result

bull Leverage Spark to dynamically split the data

bull Leverage Avro to store the data in a compact binary format

58

3 Integration Kite SDK

bull The Kite SDK provides high level abstractions to

work with datasets on Hadoop hiding many of

the details of compression codecs file formats

partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016

release so Spark jobs can read and write to Kite

datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

59

3 Integration

bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between

Elasticsearch and Apache Spark in the form of RDD that can

read data from Elasticsearch Also any RDD can be saved to

Elasticsearch as long as its content can be translated into

documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache

Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html

60

3 Integration

bull Apache Solr added a Spark-based indexing tool for

fast and easy indexing ingestion and serving

searchable complex data ldquoCrunchIndexerTool on

Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark

Crunch and Morphlines

bull Migrate ingestion of HDFS data into Solr from

MapReduce to Spark

bull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-

intosolrusingsparktrimmed

61

3 Integration

bull HUE is the open source Apache Hadoop Web UI

that lets users use Hadoop directly from their

browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark

Igniter lets users execute and monitor Spark jobs

directly from their browser and be more

productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-

hadoop-by-enrico-berti-at-big-data-spain-2014

62

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

63

4 Complementarity

Components of Hadoop ecosystem and Spark ecosystem

can work together each for what it is especially good at

rather than choosing one of them

64

Hadoop ecosystem Spark ecosystem

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

65

4 Complementarity +

bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesos

bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41

66

4 Complementarity +

References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN

cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache

Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-

resource-management

bull YARN vs MESOS Canrsquot We All Just Get

Along httpstrataconfcombig-data-conference-ca-

2015publicscheduledetail40620

67

4 Complementarity +

bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization

strategies (building the DAG with knowledge of data

distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the

need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with

YARN (resource chaining in clusters)

bull Tez supports enterprise security

68

4 Complementarity +

bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better

since it is more ldquostream orientedrdquo has more mature

shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory

parsed data it can be much better when we process

data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native

YARN Integration httphortonworkscomblogimproving-spark-data-

pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

69

4 Complementarity

bull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal

compute framework at each step in the big data

analytics process based on the type of platform the

attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on

November 13 2014 with Matt Schumpert Director of Product

Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution

Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-

right-execution-enginehtml

70

4 Complementarity

bull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles

Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-

2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to

Mainstream Apache Hadoop Adoption February 12 2015

httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-

Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms

February 23 2015

httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-

migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015

httpblogsyncsortcom201503framework-future-hadoop

71

5 Key Takeaways1 Evolution of compute models is still ongoing

Watch out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

72

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

73

1 File System

Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml

bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift

74

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 43: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

3 IntegrationService Open Source Tool

StorageServi

ng Layer

Data Formats

Data

Ingestion

Services

Resource

Management

Search

SQL

43

3 Integration

bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851

44

3 Integration

bull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via

newAPIHadoopRDD Example HBaseTestscala from

Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach

esparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available

for reading from and writing to HBase without the need

of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with

Spark Status Still in experimentation and no timetable for

possible support httpblogclouderacomblog201412new-in-cloudera-

labs-sparkonhbase

45

3 Integration

bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark

RDDs to Cassandra tables and execute arbitrary CQL

queries in your Spark applications Supports also

integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration

is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra

46

3 Integration

bull Benchmark of Spark amp Cassandra Integration

using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume

data from Cassandra to spark and store Resilient

Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new

avenues

bull Kindling An Introduction to Spark with Cassandra

(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-

spark-with-cassandra

47

3 Integration

bull MongoDB is not directly served by Spark although

it can be used from Spark via an official Mongo-

Hadoop connector

bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-

insights

bull Spark SQL also provides indirect support via its

support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

48

3 Integration

bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from

Apache Spark (still experimental)

bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-

introduction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-

example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-

example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without

Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

49

3 Integration

bull Neo4j is a highly scalable robust (fully ACID) native graph

database

bull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015

httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015

httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph

Analytics By Kenny Bastani November 3 2014

httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

50

3 Integration YARN

bull YARN Yet Another Resource Negotiator Implicit

reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND

20summary20~20yarn20AND20status203D20OPEN20ORDER20

BY20priority20DESC0A

bull Some issues are critical ones

bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

51

3 Integration

bull Spark SQL provides built in support for Hivetables

bull Import relational data from Hive tables

bull Run SQL queries over imported data

bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120

bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib

52

3 Integration

bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to

address new use cases

bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data

sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query

in-memory data in Spark Embed Drill execution in a

Spark data pipeline

Source Whats Coming in 2015 for

Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

53

3 Integration

bull Apache Kafka is a high throughput distributed

messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka

Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming

Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-

example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka

54

3 Integration

bull Apache Flume is a streaming event data

ingestion system that is designed for Big Data

ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with

Flume There are two approaches to this

bull Approach 1 Flume-style Push-based Approach

bull Approach 2 (Experimental) Pull-based

Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

55

3 Integration

bull Spark SQL provides built in support for JSON that

is vastly simplifying the end-to-end-experience of

working with JSON data

bull Spark SQL can automatically infer the schema

of a JSON dataset and load it as a

SchemaRDD No more DDL Just point Spark

SQL to JSON files and query Starting Spark 13

SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-

support-in-spark-sqlhtml

56

3 Integration

bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows to

bull Import relational data from Parquet files

bull Run SQL queries over imported data

bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

57

3 Integration

bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015

bull Problem

bull Various inbound data sets

bull Data Layout can change without notice

bull New data sets can be added without notice

Result

bull Leverage Spark to dynamically split the data

bull Leverage Avro to store the data in a compact binary format

58

3 Integration Kite SDK

bull The Kite SDK provides high level abstractions to

work with datasets on Hadoop hiding many of

the details of compression codecs file formats

partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016

release so Spark jobs can read and write to Kite

datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

59

3 Integration

bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between

Elasticsearch and Apache Spark in the form of RDD that can

read data from Elasticsearch Also any RDD can be saved to

Elasticsearch as long as its content can be translated into

documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache

Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html

60

3 Integration

bull Apache Solr added a Spark-based indexing tool for

fast and easy indexing ingestion and serving

searchable complex data ldquoCrunchIndexerTool on

Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark

Crunch and Morphlines

bull Migrate ingestion of HDFS data into Solr from

MapReduce to Spark

bull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-

intosolrusingsparktrimmed

61

3 Integration

bull HUE is the open source Apache Hadoop Web UI

that lets users use Hadoop directly from their

browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark

Igniter lets users execute and monitor Spark jobs

directly from their browser and be more

productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-

hadoop-by-enrico-berti-at-big-data-spain-2014

62

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

63

4 Complementarity

Components of Hadoop ecosystem and Spark ecosystem

can work together each for what it is especially good at

rather than choosing one of them

64

Hadoop ecosystem Spark ecosystem

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

65

4 Complementarity +

bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesos

bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41

66

4 Complementarity +

References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN

cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache

Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-

resource-management

bull YARN vs MESOS Canrsquot We All Just Get

Along httpstrataconfcombig-data-conference-ca-

2015publicscheduledetail40620

67

4 Complementarity +

bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization

strategies (building the DAG with knowledge of data

distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the

need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with

YARN (resource chaining in clusters)

bull Tez supports enterprise security

68

4 Complementarity +

bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better

since it is more ldquostream orientedrdquo has more mature

shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory

parsed data it can be much better when we process

data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native

YARN Integration httphortonworkscomblogimproving-spark-data-

pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

69

4 Complementarity

bull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal

compute framework at each step in the big data

analytics process based on the type of platform the

attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on

November 13 2014 with Matt Schumpert Director of Product

Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution

Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-

right-execution-enginehtml

70

4 Complementarity

bull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles

Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-

2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to

Mainstream Apache Hadoop Adoption February 12 2015

httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-

Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms

February 23 2015

httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-

migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015

httpblogsyncsortcom201503framework-future-hadoop

71

5 Key Takeaways1 Evolution of compute models is still ongoing

Watch out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

72

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

73

1 File System

Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml

bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift

74

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 44: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

3 Integration

bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851

44

3 Integration

bull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via

newAPIHadoopRDD Example HBaseTestscala from

Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach

esparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available

for reading from and writing to HBase without the need

of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with

Spark Status Still in experimentation and no timetable for

possible support httpblogclouderacomblog201412new-in-cloudera-

labs-sparkonhbase

45

3 Integration

bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark

RDDs to Cassandra tables and execute arbitrary CQL

queries in your Spark applications Supports also

integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration

is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra

46

3 Integration

bull Benchmark of Spark amp Cassandra Integration

using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume

data from Cassandra to spark and store Resilient

Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new

avenues

bull Kindling An Introduction to Spark with Cassandra

(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-

spark-with-cassandra

47

3 Integration

bull MongoDB is not directly served by Spark although

it can be used from Spark via an official Mongo-

Hadoop connector

bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-

insights

bull Spark SQL also provides indirect support via its

support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

48

3 Integration

bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from

Apache Spark (still experimental)

bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-

introduction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-

example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-

example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without

Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

49

3 Integration

bull Neo4j is a highly scalable robust (fully ACID) native graph

database

bull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015

httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015

httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph

Analytics By Kenny Bastani November 3 2014

httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

50

3 Integration YARN

bull YARN Yet Another Resource Negotiator Implicit

reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND

20summary20~20yarn20AND20status203D20OPEN20ORDER20

BY20priority20DESC0A

bull Some issues are critical ones

bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

51

3 Integration

bull Spark SQL provides built in support for Hivetables

bull Import relational data from Hive tables

bull Run SQL queries over imported data

bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120

bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib

52

3 Integration

bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to

address new use cases

bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data

sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query

in-memory data in Spark Embed Drill execution in a

Spark data pipeline

Source Whats Coming in 2015 for

Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

53

3 Integration

bull Apache Kafka is a high throughput distributed

messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka

Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming

Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-

example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka

54

3 Integration

bull Apache Flume is a streaming event data

ingestion system that is designed for Big Data

ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with

Flume There are two approaches to this

bull Approach 1 Flume-style Push-based Approach

bull Approach 2 (Experimental) Pull-based

Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

55

3 Integration

bull Spark SQL provides built in support for JSON that

is vastly simplifying the end-to-end-experience of

working with JSON data

bull Spark SQL can automatically infer the schema

of a JSON dataset and load it as a

SchemaRDD No more DDL Just point Spark

SQL to JSON files and query Starting Spark 13

SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-

support-in-spark-sqlhtml

56

3 Integration

bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows to

bull Import relational data from Parquet files

bull Run SQL queries over imported data

bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

57

3 Integration

bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015

bull Problem

bull Various inbound data sets

bull Data Layout can change without notice

bull New data sets can be added without notice

Result

bull Leverage Spark to dynamically split the data

bull Leverage Avro to store the data in a compact binary format

58

3 Integration Kite SDK

bull The Kite SDK provides high level abstractions to

work with datasets on Hadoop hiding many of

the details of compression codecs file formats

partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016

release so Spark jobs can read and write to Kite

datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

59

3 Integration

bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between

Elasticsearch and Apache Spark in the form of RDD that can

read data from Elasticsearch Also any RDD can be saved to

Elasticsearch as long as its content can be translated into

documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache

Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html

60

3 Integration

bull Apache Solr added a Spark-based indexing tool for

fast and easy indexing ingestion and serving

searchable complex data ldquoCrunchIndexerTool on

Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark

Crunch and Morphlines

bull Migrate ingestion of HDFS data into Solr from

MapReduce to Spark

bull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-

intosolrusingsparktrimmed

61

3 Integration

bull HUE is the open source Apache Hadoop Web UI

that lets users use Hadoop directly from their

browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark

Igniter lets users execute and monitor Spark jobs

directly from their browser and be more

productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-

hadoop-by-enrico-berti-at-big-data-spain-2014

62

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

63

4 Complementarity

Components of Hadoop ecosystem and Spark ecosystem

can work together each for what it is especially good at

rather than choosing one of them

64

Hadoop ecosystem Spark ecosystem

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

65

4 Complementarity +

bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesos

bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41

66

4 Complementarity +

References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN

cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache

Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-

resource-management

bull YARN vs MESOS Canrsquot We All Just Get

Along httpstrataconfcombig-data-conference-ca-

2015publicscheduledetail40620

67

4 Complementarity +

bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization

strategies (building the DAG with knowledge of data

distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the

need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with

YARN (resource chaining in clusters)

bull Tez supports enterprise security

68

4 Complementarity +

bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better

since it is more ldquostream orientedrdquo has more mature

shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory

parsed data it can be much better when we process

data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native

YARN Integration httphortonworkscomblogimproving-spark-data-

pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

69

4 Complementarity

bull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal

compute framework at each step in the big data

analytics process based on the type of platform the

attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on

November 13 2014 with Matt Schumpert Director of Product

Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution

Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-

right-execution-enginehtml

70

4 Complementarity

bull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles

Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-

2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to

Mainstream Apache Hadoop Adoption February 12 2015

httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-

Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms

February 23 2015

httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-

migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015

httpblogsyncsortcom201503framework-future-hadoop

71

5 Key Takeaways1 Evolution of compute models is still ongoing

Watch out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

72

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

73

1 File System

Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml

bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift

74

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 45: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

3 Integration

bull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via

newAPIHadoopRDD Example HBaseTestscala from

Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach

esparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available

for reading from and writing to HBase without the need

of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with

Spark Status Still in experimentation and no timetable for

possible support httpblogclouderacomblog201412new-in-cloudera-

labs-sparkonhbase

45

3 Integration

bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark

RDDs to Cassandra tables and execute arbitrary CQL

queries in your Spark applications Supports also

integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration

is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra

46

3 Integration

bull Benchmark of Spark amp Cassandra Integration

using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume

data from Cassandra to spark and store Resilient

Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new

avenues

bull Kindling An Introduction to Spark with Cassandra

(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-

spark-with-cassandra

47

3 Integration

bull MongoDB is not directly served by Spark although

it can be used from Spark via an official Mongo-

Hadoop connector

bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-

insights

bull Spark SQL also provides indirect support via its

support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

48

3 Integration

bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from

Apache Spark (still experimental)

bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-

introduction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-

example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-

example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without

Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

49

3 Integration

bull Neo4j is a highly scalable robust (fully ACID) native graph

database

bull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015

httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015

httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph

Analytics By Kenny Bastani November 3 2014

httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

50

3 Integration YARN

bull YARN Yet Another Resource Negotiator Implicit

reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND

20summary20~20yarn20AND20status203D20OPEN20ORDER20

BY20priority20DESC0A

bull Some issues are critical ones

bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

51

3 Integration

bull Spark SQL provides built in support for Hivetables

bull Import relational data from Hive tables

bull Run SQL queries over imported data

bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120

bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib

52

3 Integration

bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to

address new use cases

bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data

sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query

in-memory data in Spark Embed Drill execution in a

Spark data pipeline

Source Whats Coming in 2015 for

Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

53

3 Integration

bull Apache Kafka is a high throughput distributed

messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka

Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming

Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-

example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka

54

3 Integration

bull Apache Flume is a streaming event data

ingestion system that is designed for Big Data

ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with

Flume There are two approaches to this

bull Approach 1 Flume-style Push-based Approach

bull Approach 2 (Experimental) Pull-based

Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

55

3 Integration

bull Spark SQL provides built in support for JSON that

is vastly simplifying the end-to-end-experience of

working with JSON data

bull Spark SQL can automatically infer the schema

of a JSON dataset and load it as a

SchemaRDD No more DDL Just point Spark

SQL to JSON files and query Starting Spark 13

SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-

support-in-spark-sqlhtml

56

3 Integration

bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows to

bull Import relational data from Parquet files

bull Run SQL queries over imported data

bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

57

3 Integration

bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015

bull Problem

bull Various inbound data sets

bull Data Layout can change without notice

bull New data sets can be added without notice

Result

bull Leverage Spark to dynamically split the data

bull Leverage Avro to store the data in a compact binary format

58

3 Integration Kite SDK

bull The Kite SDK provides high level abstractions to

work with datasets on Hadoop hiding many of

the details of compression codecs file formats

partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016

release so Spark jobs can read and write to Kite

datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

59

3 Integration

bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between

Elasticsearch and Apache Spark in the form of RDD that can

read data from Elasticsearch Also any RDD can be saved to

Elasticsearch as long as its content can be translated into

documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache

Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html

60

3 Integration

bull Apache Solr added a Spark-based indexing tool for

fast and easy indexing ingestion and serving

searchable complex data ldquoCrunchIndexerTool on

Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark

Crunch and Morphlines

bull Migrate ingestion of HDFS data into Solr from

MapReduce to Spark

bull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-

intosolrusingsparktrimmed

61

3 Integration

bull HUE is the open source Apache Hadoop Web UI

that lets users use Hadoop directly from their

browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark

Igniter lets users execute and monitor Spark jobs

directly from their browser and be more

productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-

hadoop-by-enrico-berti-at-big-data-spain-2014

62

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

63

4 Complementarity

Components of Hadoop ecosystem and Spark ecosystem

can work together each for what it is especially good at

rather than choosing one of them

64

Hadoop ecosystem Spark ecosystem

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

65

4 Complementarity +

bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesos

bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41

66

4 Complementarity +

References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN

cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache

Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-

resource-management

bull YARN vs MESOS Canrsquot We All Just Get

Along httpstrataconfcombig-data-conference-ca-

2015publicscheduledetail40620

67

4 Complementarity +

bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization

strategies (building the DAG with knowledge of data

distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the

need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with

YARN (resource chaining in clusters)

bull Tez supports enterprise security

68

4 Complementarity +

bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better

since it is more ldquostream orientedrdquo has more mature

shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory

parsed data it can be much better when we process

data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native

YARN Integration httphortonworkscomblogimproving-spark-data-

pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

69

4 Complementarity

bull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal

compute framework at each step in the big data

analytics process based on the type of platform the

attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on

November 13 2014 with Matt Schumpert Director of Product

Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution

Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-

right-execution-enginehtml

70

4 Complementarity

bull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles

Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-

2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to

Mainstream Apache Hadoop Adoption February 12 2015

httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-

Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms

February 23 2015

httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-

migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015

httpblogsyncsortcom201503framework-future-hadoop

71

5 Key Takeaways1 Evolution of compute models is still ongoing

Watch out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

72

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

73

1 File System

Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml

bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift

74

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 46: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

3 Integration

bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark

RDDs to Cassandra tables and execute arbitrary CQL

queries in your Spark applications Supports also

integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration

is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra

46

3 Integration

bull Benchmark of Spark amp Cassandra Integration

using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume

data from Cassandra to spark and store Resilient

Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new

avenues

bull Kindling An Introduction to Spark with Cassandra

(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-

spark-with-cassandra

47

3 Integration

bull MongoDB is not directly served by Spark although

it can be used from Spark via an official Mongo-

Hadoop connector

bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-

insights

bull Spark SQL also provides indirect support via its

support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

48

3 Integration

bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from

Apache Spark (still experimental)

bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-

introduction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-

example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-

example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without

Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

49

3 Integration

bull Neo4j is a highly scalable robust (fully ACID) native graph

database

bull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015

httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015

httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph

Analytics By Kenny Bastani November 3 2014

httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

50

3 Integration YARN

bull YARN Yet Another Resource Negotiator Implicit

reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND

20summary20~20yarn20AND20status203D20OPEN20ORDER20

BY20priority20DESC0A

bull Some issues are critical ones

bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

51

3 Integration

bull Spark SQL provides built in support for Hivetables

bull Import relational data from Hive tables

bull Run SQL queries over imported data

bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120

bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib

52

3 Integration

bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to

address new use cases

bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data

sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query

in-memory data in Spark Embed Drill execution in a

Spark data pipeline

Source Whats Coming in 2015 for

Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

53

3 Integration

bull Apache Kafka is a high throughput distributed

messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka

Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming

Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-

example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka

54

3 Integration

bull Apache Flume is a streaming event data

ingestion system that is designed for Big Data

ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with

Flume There are two approaches to this

bull Approach 1 Flume-style Push-based Approach

bull Approach 2 (Experimental) Pull-based

Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

55

3 Integration

bull Spark SQL provides built in support for JSON that

is vastly simplifying the end-to-end-experience of

working with JSON data

bull Spark SQL can automatically infer the schema

of a JSON dataset and load it as a

SchemaRDD No more DDL Just point Spark

SQL to JSON files and query Starting Spark 13

SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-

support-in-spark-sqlhtml

56

3 Integration

bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows to

bull Import relational data from Parquet files

bull Run SQL queries over imported data

bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

57

3 Integration

bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015

bull Problem

bull Various inbound data sets

bull Data Layout can change without notice

bull New data sets can be added without notice

Result

bull Leverage Spark to dynamically split the data

bull Leverage Avro to store the data in a compact binary format

58

3 Integration Kite SDK

bull The Kite SDK provides high level abstractions to

work with datasets on Hadoop hiding many of

the details of compression codecs file formats

partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016

release so Spark jobs can read and write to Kite

datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

59

3 Integration

bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between

Elasticsearch and Apache Spark in the form of RDD that can

read data from Elasticsearch Also any RDD can be saved to

Elasticsearch as long as its content can be translated into

documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache

Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html

60

3 Integration

bull Apache Solr added a Spark-based indexing tool for

fast and easy indexing ingestion and serving

searchable complex data ldquoCrunchIndexerTool on

Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark

Crunch and Morphlines

bull Migrate ingestion of HDFS data into Solr from

MapReduce to Spark

bull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-

intosolrusingsparktrimmed

61

3 Integration

bull HUE is the open source Apache Hadoop Web UI

that lets users use Hadoop directly from their

browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark

Igniter lets users execute and monitor Spark jobs

directly from their browser and be more

productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-

hadoop-by-enrico-berti-at-big-data-spain-2014

62

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

63

4 Complementarity

Components of Hadoop ecosystem and Spark ecosystem

can work together each for what it is especially good at

rather than choosing one of them

64

Hadoop ecosystem Spark ecosystem

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

65

4 Complementarity +

bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesos

bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41

66

4 Complementarity +

References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN

cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache

Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-

resource-management

bull YARN vs MESOS Canrsquot We All Just Get

Along httpstrataconfcombig-data-conference-ca-

2015publicscheduledetail40620

67

4 Complementarity +

bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization

strategies (building the DAG with knowledge of data

distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the

need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with

YARN (resource chaining in clusters)

bull Tez supports enterprise security

68

4 Complementarity +

bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better

since it is more ldquostream orientedrdquo has more mature

shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory

parsed data it can be much better when we process

data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native

YARN Integration httphortonworkscomblogimproving-spark-data-

pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

69

4 Complementarity

bull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal

compute framework at each step in the big data

analytics process based on the type of platform the

attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on

November 13 2014 with Matt Schumpert Director of Product

Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution

Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-

right-execution-enginehtml

70

4 Complementarity

bull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles

Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-

2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to

Mainstream Apache Hadoop Adoption February 12 2015

httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-

Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms

February 23 2015

httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-

migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015

httpblogsyncsortcom201503framework-future-hadoop

71

5 Key Takeaways1 Evolution of compute models is still ongoing

Watch out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

72

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

73

1 File System

Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml

bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift

74

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 47: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

3 Integration

bull Benchmark of Spark amp Cassandra Integration

using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume

data from Cassandra to spark and store Resilient

Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new

avenues

bull Kindling An Introduction to Spark with Cassandra

(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-

spark-with-cassandra

47

3 Integration

bull MongoDB is not directly served by Spark although

it can be used from Spark via an official Mongo-

Hadoop connector

bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-

insights

bull Spark SQL also provides indirect support via its

support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

48

3 Integration

bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from

Apache Spark (still experimental)

bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-

introduction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-

example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-

example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without

Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

49

3 Integration

bull Neo4j is a highly scalable robust (fully ACID) native graph

database

bull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015

httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015

httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph

Analytics By Kenny Bastani November 3 2014

httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

50

3 Integration YARN

bull YARN Yet Another Resource Negotiator Implicit

reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND

20summary20~20yarn20AND20status203D20OPEN20ORDER20

BY20priority20DESC0A

bull Some issues are critical ones

bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

51

3 Integration

bull Spark SQL provides built in support for Hivetables

bull Import relational data from Hive tables

bull Run SQL queries over imported data

bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120

bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib

52

3 Integration

bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to

address new use cases

bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data

sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query

in-memory data in Spark Embed Drill execution in a

Spark data pipeline

Source Whats Coming in 2015 for

Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

53

3 Integration

bull Apache Kafka is a high throughput distributed

messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka

Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming

Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-

example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka

54

3 Integration

bull Apache Flume is a streaming event data

ingestion system that is designed for Big Data

ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with

Flume There are two approaches to this

bull Approach 1 Flume-style Push-based Approach

bull Approach 2 (Experimental) Pull-based

Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

55

3 Integration

bull Spark SQL provides built in support for JSON that

is vastly simplifying the end-to-end-experience of

working with JSON data

bull Spark SQL can automatically infer the schema

of a JSON dataset and load it as a

SchemaRDD No more DDL Just point Spark

SQL to JSON files and query Starting Spark 13

SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-

support-in-spark-sqlhtml

56

3 Integration

bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows to

bull Import relational data from Parquet files

bull Run SQL queries over imported data

bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

57

3 Integration

bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015

bull Problem

bull Various inbound data sets

bull Data Layout can change without notice

bull New data sets can be added without notice

Result

bull Leverage Spark to dynamically split the data

bull Leverage Avro to store the data in a compact binary format

58

3 Integration Kite SDK

bull The Kite SDK provides high level abstractions to

work with datasets on Hadoop hiding many of

the details of compression codecs file formats

partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016

release so Spark jobs can read and write to Kite

datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

59

3 Integration

bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between

Elasticsearch and Apache Spark in the form of RDD that can

read data from Elasticsearch Also any RDD can be saved to

Elasticsearch as long as its content can be translated into

documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache

Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html

60

3 Integration

bull Apache Solr added a Spark-based indexing tool for

fast and easy indexing ingestion and serving

searchable complex data ldquoCrunchIndexerTool on

Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark

Crunch and Morphlines

bull Migrate ingestion of HDFS data into Solr from

MapReduce to Spark

bull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-

intosolrusingsparktrimmed

61

3 Integration

bull HUE is the open source Apache Hadoop Web UI

that lets users use Hadoop directly from their

browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark

Igniter lets users execute and monitor Spark jobs

directly from their browser and be more

productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-

hadoop-by-enrico-berti-at-big-data-spain-2014

62

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

63

4 Complementarity

Components of Hadoop ecosystem and Spark ecosystem

can work together each for what it is especially good at

rather than choosing one of them

64

Hadoop ecosystem Spark ecosystem

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

65

4 Complementarity +

bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesos

bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41

66

4 Complementarity +

References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN

cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache

Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-

resource-management

bull YARN vs MESOS Canrsquot We All Just Get

Along httpstrataconfcombig-data-conference-ca-

2015publicscheduledetail40620

67

4 Complementarity +

bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization

strategies (building the DAG with knowledge of data

distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the

need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with

YARN (resource chaining in clusters)

bull Tez supports enterprise security

68

4 Complementarity +

bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better

since it is more ldquostream orientedrdquo has more mature

shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory

parsed data it can be much better when we process

data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native

YARN Integration httphortonworkscomblogimproving-spark-data-

pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

69

4 Complementarity

bull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal

compute framework at each step in the big data

analytics process based on the type of platform the

attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on

November 13 2014 with Matt Schumpert Director of Product

Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution

Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-

right-execution-enginehtml

70

4 Complementarity

bull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles

Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-

2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to

Mainstream Apache Hadoop Adoption February 12 2015

httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-

Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms

February 23 2015

httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-

migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015

httpblogsyncsortcom201503framework-future-hadoop

71

5 Key Takeaways1 Evolution of compute models is still ongoing

Watch out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

72

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

73

1 File System

Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml

bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift

74

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 48: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

3 Integration

bull MongoDB is not directly served by Spark although

it can be used from Spark via an official Mongo-

Hadoop connector

bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-

insights

bull Spark SQL also provides indirect support via its

support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

48

3 Integration

bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from

Apache Spark (still experimental)

bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-

introduction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-

example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-

example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without

Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

49

3 Integration

bull Neo4j is a highly scalable robust (fully ACID) native graph

database

bull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015

httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015

httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph

Analytics By Kenny Bastani November 3 2014

httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

50

3 Integration YARN

bull YARN Yet Another Resource Negotiator Implicit

reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND

20summary20~20yarn20AND20status203D20OPEN20ORDER20

BY20priority20DESC0A

bull Some issues are critical ones

bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

51

3 Integration

bull Spark SQL provides built in support for Hivetables

bull Import relational data from Hive tables

bull Run SQL queries over imported data

bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120

bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib

52

3 Integration

bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to

address new use cases

bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data

sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query

in-memory data in Spark Embed Drill execution in a

Spark data pipeline

Source Whats Coming in 2015 for

Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

53

3 Integration

bull Apache Kafka is a high throughput distributed

messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka

Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming

Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-

example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka

54

3 Integration

bull Apache Flume is a streaming event data

ingestion system that is designed for Big Data

ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with

Flume There are two approaches to this

bull Approach 1 Flume-style Push-based Approach

bull Approach 2 (Experimental) Pull-based

Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

55

3 Integration

bull Spark SQL provides built in support for JSON that

is vastly simplifying the end-to-end-experience of

working with JSON data

bull Spark SQL can automatically infer the schema

of a JSON dataset and load it as a

SchemaRDD No more DDL Just point Spark

SQL to JSON files and query Starting Spark 13

SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-

support-in-spark-sqlhtml

56

3 Integration

bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows to

bull Import relational data from Parquet files

bull Run SQL queries over imported data

bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

57

3 Integration

bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015

bull Problem

bull Various inbound data sets

bull Data Layout can change without notice

bull New data sets can be added without notice

Result

bull Leverage Spark to dynamically split the data

bull Leverage Avro to store the data in a compact binary format

58

3 Integration Kite SDK

bull The Kite SDK provides high level abstractions to

work with datasets on Hadoop hiding many of

the details of compression codecs file formats

partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016

release so Spark jobs can read and write to Kite

datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

59

3 Integration

bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between

Elasticsearch and Apache Spark in the form of RDD that can

read data from Elasticsearch Also any RDD can be saved to

Elasticsearch as long as its content can be translated into

documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache

Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html

60

3 Integration

bull Apache Solr added a Spark-based indexing tool for

fast and easy indexing ingestion and serving

searchable complex data ldquoCrunchIndexerTool on

Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark

Crunch and Morphlines

bull Migrate ingestion of HDFS data into Solr from

MapReduce to Spark

bull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-

intosolrusingsparktrimmed

61

3 Integration

bull HUE is the open source Apache Hadoop Web UI

that lets users use Hadoop directly from their

browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark

Igniter lets users execute and monitor Spark jobs

directly from their browser and be more

productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-

hadoop-by-enrico-berti-at-big-data-spain-2014

62

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

63

4 Complementarity

Components of Hadoop ecosystem and Spark ecosystem

can work together each for what it is especially good at

rather than choosing one of them

64

Hadoop ecosystem Spark ecosystem

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

65

4 Complementarity +

bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesos

bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41

66

4 Complementarity +

References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN

cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache

Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-

resource-management

bull YARN vs MESOS Canrsquot We All Just Get

Along httpstrataconfcombig-data-conference-ca-

2015publicscheduledetail40620

67

4 Complementarity +

bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization

strategies (building the DAG with knowledge of data

distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the

need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with

YARN (resource chaining in clusters)

bull Tez supports enterprise security

68

4 Complementarity +

bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better

since it is more ldquostream orientedrdquo has more mature

shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory

parsed data it can be much better when we process

data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native

YARN Integration httphortonworkscomblogimproving-spark-data-

pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

69

4 Complementarity

bull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal

compute framework at each step in the big data

analytics process based on the type of platform the

attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on

November 13 2014 with Matt Schumpert Director of Product

Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution

Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-

right-execution-enginehtml

70

4 Complementarity

bull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles

Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-

2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to

Mainstream Apache Hadoop Adoption February 12 2015

httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-

Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms

February 23 2015

httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-

migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015

httpblogsyncsortcom201503framework-future-hadoop

71

5 Key Takeaways1 Evolution of compute models is still ongoing

Watch out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

72

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

73

1 File System

Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml

bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift

74

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 49: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

3 Integration

bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from

Apache Spark (still experimental)

bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-

introduction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-

example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-

example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without

Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

49

3 Integration

bull Neo4j is a highly scalable robust (fully ACID) native graph

database

bull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015

httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015

httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph

Analytics By Kenny Bastani November 3 2014

httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

50

3 Integration YARN

bull YARN Yet Another Resource Negotiator Implicit

reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND

20summary20~20yarn20AND20status203D20OPEN20ORDER20

BY20priority20DESC0A

bull Some issues are critical ones

bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

51

3 Integration

bull Spark SQL provides built in support for Hivetables

bull Import relational data from Hive tables

bull Run SQL queries over imported data

bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120

bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib

52

3 Integration

bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to

address new use cases

bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data

sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query

in-memory data in Spark Embed Drill execution in a

Spark data pipeline

Source Whats Coming in 2015 for

Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

53

3 Integration

bull Apache Kafka is a high throughput distributed

messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka

Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming

Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-

example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka

54

3 Integration

bull Apache Flume is a streaming event data

ingestion system that is designed for Big Data

ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with

Flume There are two approaches to this

bull Approach 1 Flume-style Push-based Approach

bull Approach 2 (Experimental) Pull-based

Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

55

3 Integration

bull Spark SQL provides built in support for JSON that

is vastly simplifying the end-to-end-experience of

working with JSON data

bull Spark SQL can automatically infer the schema

of a JSON dataset and load it as a

SchemaRDD No more DDL Just point Spark

SQL to JSON files and query Starting Spark 13

SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-

support-in-spark-sqlhtml

56

3 Integration

bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows to

bull Import relational data from Parquet files

bull Run SQL queries over imported data

bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

57

3 Integration

bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015

bull Problem

bull Various inbound data sets

bull Data Layout can change without notice

bull New data sets can be added without notice

Result

bull Leverage Spark to dynamically split the data

bull Leverage Avro to store the data in a compact binary format

58

3 Integration Kite SDK

bull The Kite SDK provides high level abstractions to

work with datasets on Hadoop hiding many of

the details of compression codecs file formats

partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016

release so Spark jobs can read and write to Kite

datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

59

3 Integration

bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between

Elasticsearch and Apache Spark in the form of RDD that can

read data from Elasticsearch Also any RDD can be saved to

Elasticsearch as long as its content can be translated into

documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache

Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html

60

3 Integration

bull Apache Solr added a Spark-based indexing tool for

fast and easy indexing ingestion and serving

searchable complex data ldquoCrunchIndexerTool on

Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark

Crunch and Morphlines

bull Migrate ingestion of HDFS data into Solr from

MapReduce to Spark

bull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-

intosolrusingsparktrimmed

61

3 Integration

bull HUE is the open source Apache Hadoop Web UI

that lets users use Hadoop directly from their

browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark

Igniter lets users execute and monitor Spark jobs

directly from their browser and be more

productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-

hadoop-by-enrico-berti-at-big-data-spain-2014

62

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

63

4 Complementarity

Components of Hadoop ecosystem and Spark ecosystem

can work together each for what it is especially good at

rather than choosing one of them

64

Hadoop ecosystem Spark ecosystem

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

65

4 Complementarity +

bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesos

bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41

66

4 Complementarity +

References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN

cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache

Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-

resource-management

bull YARN vs MESOS Canrsquot We All Just Get

Along httpstrataconfcombig-data-conference-ca-

2015publicscheduledetail40620

67

4 Complementarity +

bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization

strategies (building the DAG with knowledge of data

distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the

need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with

YARN (resource chaining in clusters)

bull Tez supports enterprise security

68

4 Complementarity +

bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better

since it is more ldquostream orientedrdquo has more mature

shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory

parsed data it can be much better when we process

data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native

YARN Integration httphortonworkscomblogimproving-spark-data-

pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

69

4 Complementarity

bull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal

compute framework at each step in the big data

analytics process based on the type of platform the

attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on

November 13 2014 with Matt Schumpert Director of Product

Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution

Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-

right-execution-enginehtml

70

4 Complementarity

bull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles

Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-

2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to

Mainstream Apache Hadoop Adoption February 12 2015

httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-

Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms

February 23 2015

httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-

migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015

httpblogsyncsortcom201503framework-future-hadoop

71

5 Key Takeaways1 Evolution of compute models is still ongoing

Watch out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

72

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

73

1 File System

Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml

bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift

74

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 50: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

3 Integration

bull Neo4j is a highly scalable robust (fully ACID) native graph

database

bull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015

httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015

httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph

Analytics By Kenny Bastani November 3 2014

httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

50

3 Integration YARN

bull YARN Yet Another Resource Negotiator Implicit

reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND

20summary20~20yarn20AND20status203D20OPEN20ORDER20

BY20priority20DESC0A

bull Some issues are critical ones

bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

51

3 Integration

bull Spark SQL provides built in support for Hivetables

bull Import relational data from Hive tables

bull Run SQL queries over imported data

bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120

bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib

52

3 Integration

bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to

address new use cases

bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data

sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query

in-memory data in Spark Embed Drill execution in a

Spark data pipeline

Source Whats Coming in 2015 for

Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

53

3 Integration

bull Apache Kafka is a high throughput distributed

messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka

Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming

Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-

example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka

54

3 Integration

bull Apache Flume is a streaming event data

ingestion system that is designed for Big Data

ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with

Flume There are two approaches to this

bull Approach 1 Flume-style Push-based Approach

bull Approach 2 (Experimental) Pull-based

Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

55

3 Integration

bull Spark SQL provides built in support for JSON that

is vastly simplifying the end-to-end-experience of

working with JSON data

bull Spark SQL can automatically infer the schema

of a JSON dataset and load it as a

SchemaRDD No more DDL Just point Spark

SQL to JSON files and query Starting Spark 13

SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-

support-in-spark-sqlhtml

56

3 Integration

bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows to

bull Import relational data from Parquet files

bull Run SQL queries over imported data

bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

57

3 Integration

bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015

bull Problem

bull Various inbound data sets

bull Data Layout can change without notice

bull New data sets can be added without notice

Result

bull Leverage Spark to dynamically split the data

bull Leverage Avro to store the data in a compact binary format

58

3 Integration Kite SDK

bull The Kite SDK provides high level abstractions to

work with datasets on Hadoop hiding many of

the details of compression codecs file formats

partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016

release so Spark jobs can read and write to Kite

datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

59

3 Integration

bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between

Elasticsearch and Apache Spark in the form of RDD that can

read data from Elasticsearch Also any RDD can be saved to

Elasticsearch as long as its content can be translated into

documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache

Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html

60

3 Integration

bull Apache Solr added a Spark-based indexing tool for

fast and easy indexing ingestion and serving

searchable complex data ldquoCrunchIndexerTool on

Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark

Crunch and Morphlines

bull Migrate ingestion of HDFS data into Solr from

MapReduce to Spark

bull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-

intosolrusingsparktrimmed

61

3 Integration

bull HUE is the open source Apache Hadoop Web UI

that lets users use Hadoop directly from their

browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark

Igniter lets users execute and monitor Spark jobs

directly from their browser and be more

productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-

hadoop-by-enrico-berti-at-big-data-spain-2014

62

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

63

4 Complementarity

Components of Hadoop ecosystem and Spark ecosystem

can work together each for what it is especially good at

rather than choosing one of them

64

Hadoop ecosystem Spark ecosystem

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

65

4 Complementarity +

bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesos

bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41

66

4 Complementarity +

References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN

cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache

Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-

resource-management

bull YARN vs MESOS Canrsquot We All Just Get

Along httpstrataconfcombig-data-conference-ca-

2015publicscheduledetail40620

67

4 Complementarity +

bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization

strategies (building the DAG with knowledge of data

distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the

need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with

YARN (resource chaining in clusters)

bull Tez supports enterprise security

68

4 Complementarity +

bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better

since it is more ldquostream orientedrdquo has more mature

shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory

parsed data it can be much better when we process

data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native

YARN Integration httphortonworkscomblogimproving-spark-data-

pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

69

4 Complementarity

bull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal

compute framework at each step in the big data

analytics process based on the type of platform the

attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on

November 13 2014 with Matt Schumpert Director of Product

Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution

Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-

right-execution-enginehtml

70

4 Complementarity

bull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles

Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-

2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to

Mainstream Apache Hadoop Adoption February 12 2015

httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-

Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms

February 23 2015

httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-

migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015

httpblogsyncsortcom201503framework-future-hadoop

71

5 Key Takeaways1 Evolution of compute models is still ongoing

Watch out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

72

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

73

1 File System

Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml

bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift

74

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 51: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

3 Integration YARN

bull YARN Yet Another Resource Negotiator Implicit

reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND

20summary20~20yarn20AND20status203D20OPEN20ORDER20

BY20priority20DESC0A

bull Some issues are critical ones

bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

51

3 Integration

bull Spark SQL provides built in support for Hivetables

bull Import relational data from Hive tables

bull Run SQL queries over imported data

bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120

bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib

52

3 Integration

bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to

address new use cases

bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data

sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query

in-memory data in Spark Embed Drill execution in a

Spark data pipeline

Source Whats Coming in 2015 for

Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

53

3 Integration

bull Apache Kafka is a high throughput distributed

messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka

Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming

Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-

example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka

54

3 Integration

bull Apache Flume is a streaming event data

ingestion system that is designed for Big Data

ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with

Flume There are two approaches to this

bull Approach 1 Flume-style Push-based Approach

bull Approach 2 (Experimental) Pull-based

Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

55

3 Integration

bull Spark SQL provides built in support for JSON that

is vastly simplifying the end-to-end-experience of

working with JSON data

bull Spark SQL can automatically infer the schema

of a JSON dataset and load it as a

SchemaRDD No more DDL Just point Spark

SQL to JSON files and query Starting Spark 13

SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-

support-in-spark-sqlhtml

56

3 Integration

bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows to

bull Import relational data from Parquet files

bull Run SQL queries over imported data

bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

57

3 Integration

bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015

bull Problem

bull Various inbound data sets

bull Data Layout can change without notice

bull New data sets can be added without notice

Result

bull Leverage Spark to dynamically split the data

bull Leverage Avro to store the data in a compact binary format

58

3 Integration Kite SDK

bull The Kite SDK provides high level abstractions to

work with datasets on Hadoop hiding many of

the details of compression codecs file formats

partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016

release so Spark jobs can read and write to Kite

datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

59

3 Integration

bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between

Elasticsearch and Apache Spark in the form of RDD that can

read data from Elasticsearch Also any RDD can be saved to

Elasticsearch as long as its content can be translated into

documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache

Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html

60

3 Integration

bull Apache Solr added a Spark-based indexing tool for

fast and easy indexing ingestion and serving

searchable complex data ldquoCrunchIndexerTool on

Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark

Crunch and Morphlines

bull Migrate ingestion of HDFS data into Solr from

MapReduce to Spark

bull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-

intosolrusingsparktrimmed

61

3 Integration

bull HUE is the open source Apache Hadoop Web UI

that lets users use Hadoop directly from their

browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark

Igniter lets users execute and monitor Spark jobs

directly from their browser and be more

productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-

hadoop-by-enrico-berti-at-big-data-spain-2014

62

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

63

4 Complementarity

Components of Hadoop ecosystem and Spark ecosystem

can work together each for what it is especially good at

rather than choosing one of them

64

Hadoop ecosystem Spark ecosystem

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

65

4 Complementarity +

bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesos

bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41

66

4 Complementarity +

References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN

cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache

Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-

resource-management

bull YARN vs MESOS Canrsquot We All Just Get

Along httpstrataconfcombig-data-conference-ca-

2015publicscheduledetail40620

67

4 Complementarity +

bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization

strategies (building the DAG with knowledge of data

distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the

need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with

YARN (resource chaining in clusters)

bull Tez supports enterprise security

68

4 Complementarity +

bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better

since it is more ldquostream orientedrdquo has more mature

shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory

parsed data it can be much better when we process

data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native

YARN Integration httphortonworkscomblogimproving-spark-data-

pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

69

4 Complementarity

bull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal

compute framework at each step in the big data

analytics process based on the type of platform the

attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on

November 13 2014 with Matt Schumpert Director of Product

Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution

Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-

right-execution-enginehtml

70

4 Complementarity

bull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles

Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-

2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to

Mainstream Apache Hadoop Adoption February 12 2015

httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-

Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms

February 23 2015

httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-

migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015

httpblogsyncsortcom201503framework-future-hadoop

71

5 Key Takeaways1 Evolution of compute models is still ongoing

Watch out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

72

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

73

1 File System

Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml

bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift

74

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 52: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

3 Integration

bull Spark SQL provides built in support for Hivetables

bull Import relational data from Hive tables

bull Run SQL queries over imported data

bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120

bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib

52

3 Integration

bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to

address new use cases

bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data

sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query

in-memory data in Spark Embed Drill execution in a

Spark data pipeline

Source Whats Coming in 2015 for

Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

53

3 Integration

bull Apache Kafka is a high throughput distributed

messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka

Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming

Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-

example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka

54

3 Integration

bull Apache Flume is a streaming event data

ingestion system that is designed for Big Data

ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with

Flume There are two approaches to this

bull Approach 1 Flume-style Push-based Approach

bull Approach 2 (Experimental) Pull-based

Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

55

3 Integration

bull Spark SQL provides built in support for JSON that

is vastly simplifying the end-to-end-experience of

working with JSON data

bull Spark SQL can automatically infer the schema

of a JSON dataset and load it as a

SchemaRDD No more DDL Just point Spark

SQL to JSON files and query Starting Spark 13

SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-

support-in-spark-sqlhtml

56

3 Integration

bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows to

bull Import relational data from Parquet files

bull Run SQL queries over imported data

bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

57

3 Integration

bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015

bull Problem

bull Various inbound data sets

bull Data Layout can change without notice

bull New data sets can be added without notice

Result

bull Leverage Spark to dynamically split the data

bull Leverage Avro to store the data in a compact binary format

58

3 Integration Kite SDK

bull The Kite SDK provides high level abstractions to

work with datasets on Hadoop hiding many of

the details of compression codecs file formats

partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016

release so Spark jobs can read and write to Kite

datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

59

3 Integration

bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between

Elasticsearch and Apache Spark in the form of RDD that can

read data from Elasticsearch Also any RDD can be saved to

Elasticsearch as long as its content can be translated into

documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache

Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html

60

3 Integration

bull Apache Solr added a Spark-based indexing tool for

fast and easy indexing ingestion and serving

searchable complex data ldquoCrunchIndexerTool on

Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark

Crunch and Morphlines

bull Migrate ingestion of HDFS data into Solr from

MapReduce to Spark

bull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-

intosolrusingsparktrimmed

61

3 Integration

bull HUE is the open source Apache Hadoop Web UI

that lets users use Hadoop directly from their

browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark

Igniter lets users execute and monitor Spark jobs

directly from their browser and be more

productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-

hadoop-by-enrico-berti-at-big-data-spain-2014

62

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

63

4 Complementarity

Components of Hadoop ecosystem and Spark ecosystem

can work together each for what it is especially good at

rather than choosing one of them

64

Hadoop ecosystem Spark ecosystem

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

65

4 Complementarity +

bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesos

bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41

66

4 Complementarity +

References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN

cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache

Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-

resource-management

bull YARN vs MESOS Canrsquot We All Just Get

Along httpstrataconfcombig-data-conference-ca-

2015publicscheduledetail40620

67

4 Complementarity +

bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization

strategies (building the DAG with knowledge of data

distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the

need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with

YARN (resource chaining in clusters)

bull Tez supports enterprise security

68

4 Complementarity +

bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better

since it is more ldquostream orientedrdquo has more mature

shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory

parsed data it can be much better when we process

data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native

YARN Integration httphortonworkscomblogimproving-spark-data-

pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

69

4 Complementarity

bull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal

compute framework at each step in the big data

analytics process based on the type of platform the

attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on

November 13 2014 with Matt Schumpert Director of Product

Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution

Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-

right-execution-enginehtml

70

4 Complementarity

bull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles

Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-

2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to

Mainstream Apache Hadoop Adoption February 12 2015

httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-

Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms

February 23 2015

httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-

migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015

httpblogsyncsortcom201503framework-future-hadoop

71

5 Key Takeaways1 Evolution of compute models is still ongoing

Watch out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

72

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

73

1 File System

Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml

bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift

74

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 53: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

3 Integration

bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to

address new use cases

bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data

sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query

in-memory data in Spark Embed Drill execution in a

Spark data pipeline

Source Whats Coming in 2015 for

Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

53

3 Integration

bull Apache Kafka is a high throughput distributed

messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka

Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming

Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-

example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka

54

3 Integration

bull Apache Flume is a streaming event data

ingestion system that is designed for Big Data

ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with

Flume There are two approaches to this

bull Approach 1 Flume-style Push-based Approach

bull Approach 2 (Experimental) Pull-based

Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

55

3 Integration

bull Spark SQL provides built in support for JSON that

is vastly simplifying the end-to-end-experience of

working with JSON data

bull Spark SQL can automatically infer the schema

of a JSON dataset and load it as a

SchemaRDD No more DDL Just point Spark

SQL to JSON files and query Starting Spark 13

SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-

support-in-spark-sqlhtml

56

3 Integration

bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows to

bull Import relational data from Parquet files

bull Run SQL queries over imported data

bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

57

3 Integration

bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015

bull Problem

bull Various inbound data sets

bull Data Layout can change without notice

bull New data sets can be added without notice

Result

bull Leverage Spark to dynamically split the data

bull Leverage Avro to store the data in a compact binary format

58

3 Integration Kite SDK

bull The Kite SDK provides high level abstractions to

work with datasets on Hadoop hiding many of

the details of compression codecs file formats

partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016

release so Spark jobs can read and write to Kite

datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

59

3 Integration

bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between

Elasticsearch and Apache Spark in the form of RDD that can

read data from Elasticsearch Also any RDD can be saved to

Elasticsearch as long as its content can be translated into

documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache

Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html

60

3 Integration

bull Apache Solr added a Spark-based indexing tool for

fast and easy indexing ingestion and serving

searchable complex data ldquoCrunchIndexerTool on

Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark

Crunch and Morphlines

bull Migrate ingestion of HDFS data into Solr from

MapReduce to Spark

bull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-

intosolrusingsparktrimmed

61

3 Integration

bull HUE is the open source Apache Hadoop Web UI

that lets users use Hadoop directly from their

browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark

Igniter lets users execute and monitor Spark jobs

directly from their browser and be more

productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-

hadoop-by-enrico-berti-at-big-data-spain-2014

62

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

63

4 Complementarity

Components of Hadoop ecosystem and Spark ecosystem

can work together each for what it is especially good at

rather than choosing one of them

64

Hadoop ecosystem Spark ecosystem

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

65

4 Complementarity +

bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesos

bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41

66

4 Complementarity +

References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN

cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache

Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-

resource-management

bull YARN vs MESOS Canrsquot We All Just Get

Along httpstrataconfcombig-data-conference-ca-

2015publicscheduledetail40620

67

4 Complementarity +

bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization

strategies (building the DAG with knowledge of data

distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the

need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with

YARN (resource chaining in clusters)

bull Tez supports enterprise security

68

4 Complementarity +

bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better

since it is more ldquostream orientedrdquo has more mature

shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory

parsed data it can be much better when we process

data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native

YARN Integration httphortonworkscomblogimproving-spark-data-

pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

69

4 Complementarity

bull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal

compute framework at each step in the big data

analytics process based on the type of platform the

attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on

November 13 2014 with Matt Schumpert Director of Product

Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution

Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-

right-execution-enginehtml

70

4 Complementarity

bull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles

Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-

2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to

Mainstream Apache Hadoop Adoption February 12 2015

httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-

Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms

February 23 2015

httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-

migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015

httpblogsyncsortcom201503framework-future-hadoop

71

5 Key Takeaways1 Evolution of compute models is still ongoing

Watch out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

72

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

73

1 File System

Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml

bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift

74

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 54: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

3 Integration

bull Apache Kafka is a high throughput distributed

messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka

Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming

Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-

example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka

54

3 Integration

bull Apache Flume is a streaming event data

ingestion system that is designed for Big Data

ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with

Flume There are two approaches to this

bull Approach 1 Flume-style Push-based Approach

bull Approach 2 (Experimental) Pull-based

Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

55

3 Integration

bull Spark SQL provides built in support for JSON that

is vastly simplifying the end-to-end-experience of

working with JSON data

bull Spark SQL can automatically infer the schema

of a JSON dataset and load it as a

SchemaRDD No more DDL Just point Spark

SQL to JSON files and query Starting Spark 13

SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-

support-in-spark-sqlhtml

56

3 Integration

bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows to

bull Import relational data from Parquet files

bull Run SQL queries over imported data

bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

57

3 Integration

bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015

bull Problem

bull Various inbound data sets

bull Data Layout can change without notice

bull New data sets can be added without notice

Result

bull Leverage Spark to dynamically split the data

bull Leverage Avro to store the data in a compact binary format

58

3 Integration Kite SDK

bull The Kite SDK provides high level abstractions to

work with datasets on Hadoop hiding many of

the details of compression codecs file formats

partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016

release so Spark jobs can read and write to Kite

datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

59

3 Integration

bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between

Elasticsearch and Apache Spark in the form of RDD that can

read data from Elasticsearch Also any RDD can be saved to

Elasticsearch as long as its content can be translated into

documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache

Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html

60

3 Integration

bull Apache Solr added a Spark-based indexing tool for

fast and easy indexing ingestion and serving

searchable complex data ldquoCrunchIndexerTool on

Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark

Crunch and Morphlines

bull Migrate ingestion of HDFS data into Solr from

MapReduce to Spark

bull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-

intosolrusingsparktrimmed

61

3 Integration

bull HUE is the open source Apache Hadoop Web UI

that lets users use Hadoop directly from their

browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark

Igniter lets users execute and monitor Spark jobs

directly from their browser and be more

productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-

hadoop-by-enrico-berti-at-big-data-spain-2014

62

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

63

4 Complementarity

Components of Hadoop ecosystem and Spark ecosystem

can work together each for what it is especially good at

rather than choosing one of them

64

Hadoop ecosystem Spark ecosystem

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

65

4 Complementarity +

bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesos

bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41

66

4 Complementarity +

References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN

cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache

Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-

resource-management

bull YARN vs MESOS Canrsquot We All Just Get

Along httpstrataconfcombig-data-conference-ca-

2015publicscheduledetail40620

67

4 Complementarity +

bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization

strategies (building the DAG with knowledge of data

distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the

need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with

YARN (resource chaining in clusters)

bull Tez supports enterprise security

68

4 Complementarity +

bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better

since it is more ldquostream orientedrdquo has more mature

shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory

parsed data it can be much better when we process

data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native

YARN Integration httphortonworkscomblogimproving-spark-data-

pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

69

4 Complementarity

bull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal

compute framework at each step in the big data

analytics process based on the type of platform the

attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on

November 13 2014 with Matt Schumpert Director of Product

Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution

Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-

right-execution-enginehtml

70

4 Complementarity

bull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles

Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-

2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to

Mainstream Apache Hadoop Adoption February 12 2015

httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-

Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms

February 23 2015

httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-

migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015

httpblogsyncsortcom201503framework-future-hadoop

71

5 Key Takeaways1 Evolution of compute models is still ongoing

Watch out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

72

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

73

1 File System

Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml

bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift

74

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 55: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

3 Integration

bull Apache Flume is a streaming event data

ingestion system that is designed for Big Data

ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with

Flume There are two approaches to this

bull Approach 1 Flume-style Push-based Approach

bull Approach 2 (Experimental) Pull-based

Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

55

3 Integration

bull Spark SQL provides built in support for JSON that

is vastly simplifying the end-to-end-experience of

working with JSON data

bull Spark SQL can automatically infer the schema

of a JSON dataset and load it as a

SchemaRDD No more DDL Just point Spark

SQL to JSON files and query Starting Spark 13

SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-

support-in-spark-sqlhtml

56

3 Integration

bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows to

bull Import relational data from Parquet files

bull Run SQL queries over imported data

bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

57

3 Integration

bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015

bull Problem

bull Various inbound data sets

bull Data Layout can change without notice

bull New data sets can be added without notice

Result

bull Leverage Spark to dynamically split the data

bull Leverage Avro to store the data in a compact binary format

58

3 Integration Kite SDK

bull The Kite SDK provides high level abstractions to

work with datasets on Hadoop hiding many of

the details of compression codecs file formats

partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016

release so Spark jobs can read and write to Kite

datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

59

3 Integration

bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between

Elasticsearch and Apache Spark in the form of RDD that can

read data from Elasticsearch Also any RDD can be saved to

Elasticsearch as long as its content can be translated into

documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache

Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html

60

3 Integration

bull Apache Solr added a Spark-based indexing tool for

fast and easy indexing ingestion and serving

searchable complex data ldquoCrunchIndexerTool on

Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark

Crunch and Morphlines

bull Migrate ingestion of HDFS data into Solr from

MapReduce to Spark

bull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-

intosolrusingsparktrimmed

61

3 Integration

bull HUE is the open source Apache Hadoop Web UI

that lets users use Hadoop directly from their

browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark

Igniter lets users execute and monitor Spark jobs

directly from their browser and be more

productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-

hadoop-by-enrico-berti-at-big-data-spain-2014

62

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

63

4 Complementarity

Components of Hadoop ecosystem and Spark ecosystem

can work together each for what it is especially good at

rather than choosing one of them

64

Hadoop ecosystem Spark ecosystem

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

65

4 Complementarity +

bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesos

bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41

66

4 Complementarity +

References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN

cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache

Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-

resource-management

bull YARN vs MESOS Canrsquot We All Just Get

Along httpstrataconfcombig-data-conference-ca-

2015publicscheduledetail40620

67

4 Complementarity +

bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization

strategies (building the DAG with knowledge of data

distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the

need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with

YARN (resource chaining in clusters)

bull Tez supports enterprise security

68

4 Complementarity +

bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better

since it is more ldquostream orientedrdquo has more mature

shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory

parsed data it can be much better when we process

data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native

YARN Integration httphortonworkscomblogimproving-spark-data-

pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

69

4 Complementarity

bull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal

compute framework at each step in the big data

analytics process based on the type of platform the

attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on

November 13 2014 with Matt Schumpert Director of Product

Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution

Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-

right-execution-enginehtml

70

4 Complementarity

bull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles

Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-

2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to

Mainstream Apache Hadoop Adoption February 12 2015

httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-

Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms

February 23 2015

httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-

migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015

httpblogsyncsortcom201503framework-future-hadoop

71

5 Key Takeaways1 Evolution of compute models is still ongoing

Watch out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

72

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

73

1 File System

Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml

bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift

74

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 56: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

3 Integration

bull Spark SQL provides built in support for JSON that

is vastly simplifying the end-to-end-experience of

working with JSON data

bull Spark SQL can automatically infer the schema

of a JSON dataset and load it as a

SchemaRDD No more DDL Just point Spark

SQL to JSON files and query Starting Spark 13

SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-

support-in-spark-sqlhtml

56

3 Integration

bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows to

bull Import relational data from Parquet files

bull Run SQL queries over imported data

bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

57

3 Integration

bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015

bull Problem

bull Various inbound data sets

bull Data Layout can change without notice

bull New data sets can be added without notice

Result

bull Leverage Spark to dynamically split the data

bull Leverage Avro to store the data in a compact binary format

58

3 Integration Kite SDK

bull The Kite SDK provides high level abstractions to

work with datasets on Hadoop hiding many of

the details of compression codecs file formats

partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016

release so Spark jobs can read and write to Kite

datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

59

3 Integration

bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between

Elasticsearch and Apache Spark in the form of RDD that can

read data from Elasticsearch Also any RDD can be saved to

Elasticsearch as long as its content can be translated into

documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache

Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html

60

3 Integration

bull Apache Solr added a Spark-based indexing tool for

fast and easy indexing ingestion and serving

searchable complex data ldquoCrunchIndexerTool on

Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark

Crunch and Morphlines

bull Migrate ingestion of HDFS data into Solr from

MapReduce to Spark

bull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-

intosolrusingsparktrimmed

61

3 Integration

bull HUE is the open source Apache Hadoop Web UI

that lets users use Hadoop directly from their

browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark

Igniter lets users execute and monitor Spark jobs

directly from their browser and be more

productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-

hadoop-by-enrico-berti-at-big-data-spain-2014

62

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

63

4 Complementarity

Components of Hadoop ecosystem and Spark ecosystem

can work together each for what it is especially good at

rather than choosing one of them

64

Hadoop ecosystem Spark ecosystem

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

65

4 Complementarity +

bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesos

bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41

66

4 Complementarity +

References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN

cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache

Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-

resource-management

bull YARN vs MESOS Canrsquot We All Just Get

Along httpstrataconfcombig-data-conference-ca-

2015publicscheduledetail40620

67

4 Complementarity +

bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization

strategies (building the DAG with knowledge of data

distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the

need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with

YARN (resource chaining in clusters)

bull Tez supports enterprise security

68

4 Complementarity +

bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better

since it is more ldquostream orientedrdquo has more mature

shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory

parsed data it can be much better when we process

data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native

YARN Integration httphortonworkscomblogimproving-spark-data-

pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

69

4 Complementarity

bull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal

compute framework at each step in the big data

analytics process based on the type of platform the

attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on

November 13 2014 with Matt Schumpert Director of Product

Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution

Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-

right-execution-enginehtml

70

4 Complementarity

bull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles

Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-

2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to

Mainstream Apache Hadoop Adoption February 12 2015

httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-

Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms

February 23 2015

httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-

migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015

httpblogsyncsortcom201503framework-future-hadoop

71

5 Key Takeaways1 Evolution of compute models is still ongoing

Watch out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

72

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

73

1 File System

Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml

bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift

74

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 57: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

3 Integration

bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows to

bull Import relational data from Parquet files

bull Run SQL queries over imported data

bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

57

3 Integration

bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015

bull Problem

bull Various inbound data sets

bull Data Layout can change without notice

bull New data sets can be added without notice

Result

bull Leverage Spark to dynamically split the data

bull Leverage Avro to store the data in a compact binary format

58

3 Integration Kite SDK

bull The Kite SDK provides high level abstractions to

work with datasets on Hadoop hiding many of

the details of compression codecs file formats

partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016

release so Spark jobs can read and write to Kite

datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

59

3 Integration

bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between

Elasticsearch and Apache Spark in the form of RDD that can

read data from Elasticsearch Also any RDD can be saved to

Elasticsearch as long as its content can be translated into

documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache

Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html

60

3 Integration

bull Apache Solr added a Spark-based indexing tool for

fast and easy indexing ingestion and serving

searchable complex data ldquoCrunchIndexerTool on

Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark

Crunch and Morphlines

bull Migrate ingestion of HDFS data into Solr from

MapReduce to Spark

bull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-

intosolrusingsparktrimmed

61

3 Integration

bull HUE is the open source Apache Hadoop Web UI

that lets users use Hadoop directly from their

browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark

Igniter lets users execute and monitor Spark jobs

directly from their browser and be more

productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-

hadoop-by-enrico-berti-at-big-data-spain-2014

62

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

63

4 Complementarity

Components of Hadoop ecosystem and Spark ecosystem

can work together each for what it is especially good at

rather than choosing one of them

64

Hadoop ecosystem Spark ecosystem

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

65

4 Complementarity +

bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesos

bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41

66

4 Complementarity +

References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN

cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache

Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-

resource-management

bull YARN vs MESOS Canrsquot We All Just Get

Along httpstrataconfcombig-data-conference-ca-

2015publicscheduledetail40620

67

4 Complementarity +

bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization

strategies (building the DAG with knowledge of data

distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the

need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with

YARN (resource chaining in clusters)

bull Tez supports enterprise security

68

4 Complementarity +

bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better

since it is more ldquostream orientedrdquo has more mature

shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory

parsed data it can be much better when we process

data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native

YARN Integration httphortonworkscomblogimproving-spark-data-

pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

69

4 Complementarity

bull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal

compute framework at each step in the big data

analytics process based on the type of platform the

attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on

November 13 2014 with Matt Schumpert Director of Product

Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution

Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-

right-execution-enginehtml

70

4 Complementarity

bull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles

Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-

2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to

Mainstream Apache Hadoop Adoption February 12 2015

httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-

Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms

February 23 2015

httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-

migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015

httpblogsyncsortcom201503framework-future-hadoop

71

5 Key Takeaways1 Evolution of compute models is still ongoing

Watch out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

72

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

73

1 File System

Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml

bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift

74

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 58: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

3 Integration

bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015

bull Problem

bull Various inbound data sets

bull Data Layout can change without notice

bull New data sets can be added without notice

Result

bull Leverage Spark to dynamically split the data

bull Leverage Avro to store the data in a compact binary format

58

3 Integration Kite SDK

bull The Kite SDK provides high level abstractions to

work with datasets on Hadoop hiding many of

the details of compression codecs file formats

partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016

release so Spark jobs can read and write to Kite

datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

59

3 Integration

bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between

Elasticsearch and Apache Spark in the form of RDD that can

read data from Elasticsearch Also any RDD can be saved to

Elasticsearch as long as its content can be translated into

documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache

Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html

60

3 Integration

bull Apache Solr added a Spark-based indexing tool for

fast and easy indexing ingestion and serving

searchable complex data ldquoCrunchIndexerTool on

Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark

Crunch and Morphlines

bull Migrate ingestion of HDFS data into Solr from

MapReduce to Spark

bull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-

intosolrusingsparktrimmed

61

3 Integration

bull HUE is the open source Apache Hadoop Web UI

that lets users use Hadoop directly from their

browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark

Igniter lets users execute and monitor Spark jobs

directly from their browser and be more

productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-

hadoop-by-enrico-berti-at-big-data-spain-2014

62

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

63

4 Complementarity

Components of Hadoop ecosystem and Spark ecosystem

can work together each for what it is especially good at

rather than choosing one of them

64

Hadoop ecosystem Spark ecosystem

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

65

4 Complementarity +

bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesos

bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41

66

4 Complementarity +

References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN

cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache

Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-

resource-management

bull YARN vs MESOS Canrsquot We All Just Get

Along httpstrataconfcombig-data-conference-ca-

2015publicscheduledetail40620

67

4 Complementarity +

bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization

strategies (building the DAG with knowledge of data

distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the

need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with

YARN (resource chaining in clusters)

bull Tez supports enterprise security

68

4 Complementarity +

bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better

since it is more ldquostream orientedrdquo has more mature

shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory

parsed data it can be much better when we process

data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native

YARN Integration httphortonworkscomblogimproving-spark-data-

pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

69

4 Complementarity

bull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal

compute framework at each step in the big data

analytics process based on the type of platform the

attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on

November 13 2014 with Matt Schumpert Director of Product

Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution

Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-

right-execution-enginehtml

70

4 Complementarity

bull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles

Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-

2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to

Mainstream Apache Hadoop Adoption February 12 2015

httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-

Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms

February 23 2015

httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-

migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015

httpblogsyncsortcom201503framework-future-hadoop

71

5 Key Takeaways1 Evolution of compute models is still ongoing

Watch out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

72

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

73

1 File System

Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml

bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift

74

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 59: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

3 Integration Kite SDK

bull The Kite SDK provides high level abstractions to

work with datasets on Hadoop hiding many of

the details of compression codecs file formats

partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016

release so Spark jobs can read and write to Kite

datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

59

3 Integration

bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between

Elasticsearch and Apache Spark in the form of RDD that can

read data from Elasticsearch Also any RDD can be saved to

Elasticsearch as long as its content can be translated into

documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache

Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html

60

3 Integration

bull Apache Solr added a Spark-based indexing tool for

fast and easy indexing ingestion and serving

searchable complex data ldquoCrunchIndexerTool on

Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark

Crunch and Morphlines

bull Migrate ingestion of HDFS data into Solr from

MapReduce to Spark

bull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-

intosolrusingsparktrimmed

61

3 Integration

bull HUE is the open source Apache Hadoop Web UI

that lets users use Hadoop directly from their

browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark

Igniter lets users execute and monitor Spark jobs

directly from their browser and be more

productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-

hadoop-by-enrico-berti-at-big-data-spain-2014

62

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

63

4 Complementarity

Components of Hadoop ecosystem and Spark ecosystem

can work together each for what it is especially good at

rather than choosing one of them

64

Hadoop ecosystem Spark ecosystem

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

65

4 Complementarity +

bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesos

bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41

66

4 Complementarity +

References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN

cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache

Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-

resource-management

bull YARN vs MESOS Canrsquot We All Just Get

Along httpstrataconfcombig-data-conference-ca-

2015publicscheduledetail40620

67

4 Complementarity +

bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization

strategies (building the DAG with knowledge of data

distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the

need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with

YARN (resource chaining in clusters)

bull Tez supports enterprise security

68

4 Complementarity +

bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better

since it is more ldquostream orientedrdquo has more mature

shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory

parsed data it can be much better when we process

data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native

YARN Integration httphortonworkscomblogimproving-spark-data-

pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

69

4 Complementarity

bull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal

compute framework at each step in the big data

analytics process based on the type of platform the

attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on

November 13 2014 with Matt Schumpert Director of Product

Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution

Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-

right-execution-enginehtml

70

4 Complementarity

bull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles

Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-

2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to

Mainstream Apache Hadoop Adoption February 12 2015

httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-

Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms

February 23 2015

httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-

migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015

httpblogsyncsortcom201503framework-future-hadoop

71

5 Key Takeaways1 Evolution of compute models is still ongoing

Watch out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

72

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

73

1 File System

Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml

bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift

74

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 60: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

3 Integration

bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between

Elasticsearch and Apache Spark in the form of RDD that can

read data from Elasticsearch Also any RDD can be saved to

Elasticsearch as long as its content can be translated into

documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache

Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html

60

3 Integration

bull Apache Solr added a Spark-based indexing tool for

fast and easy indexing ingestion and serving

searchable complex data ldquoCrunchIndexerTool on

Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark

Crunch and Morphlines

bull Migrate ingestion of HDFS data into Solr from

MapReduce to Spark

bull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-

intosolrusingsparktrimmed

61

3 Integration

bull HUE is the open source Apache Hadoop Web UI

that lets users use Hadoop directly from their

browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark

Igniter lets users execute and monitor Spark jobs

directly from their browser and be more

productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-

hadoop-by-enrico-berti-at-big-data-spain-2014

62

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

63

4 Complementarity

Components of Hadoop ecosystem and Spark ecosystem

can work together each for what it is especially good at

rather than choosing one of them

64

Hadoop ecosystem Spark ecosystem

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

65

4 Complementarity +

bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesos

bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41

66

4 Complementarity +

References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN

cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache

Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-

resource-management

bull YARN vs MESOS Canrsquot We All Just Get

Along httpstrataconfcombig-data-conference-ca-

2015publicscheduledetail40620

67

4 Complementarity +

bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization

strategies (building the DAG with knowledge of data

distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the

need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with

YARN (resource chaining in clusters)

bull Tez supports enterprise security

68

4 Complementarity +

bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better

since it is more ldquostream orientedrdquo has more mature

shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory

parsed data it can be much better when we process

data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native

YARN Integration httphortonworkscomblogimproving-spark-data-

pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

69

4 Complementarity

bull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal

compute framework at each step in the big data

analytics process based on the type of platform the

attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on

November 13 2014 with Matt Schumpert Director of Product

Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution

Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-

right-execution-enginehtml

70

4 Complementarity

bull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles

Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-

2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to

Mainstream Apache Hadoop Adoption February 12 2015

httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-

Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms

February 23 2015

httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-

migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015

httpblogsyncsortcom201503framework-future-hadoop

71

5 Key Takeaways1 Evolution of compute models is still ongoing

Watch out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

72

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

73

1 File System

Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml

bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift

74

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 61: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

3 Integration

bull Apache Solr added a Spark-based indexing tool for

fast and easy indexing ingestion and serving

searchable complex data ldquoCrunchIndexerTool on

Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark

Crunch and Morphlines

bull Migrate ingestion of HDFS data into Solr from

MapReduce to Spark

bull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-

intosolrusingsparktrimmed

61

3 Integration

bull HUE is the open source Apache Hadoop Web UI

that lets users use Hadoop directly from their

browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark

Igniter lets users execute and monitor Spark jobs

directly from their browser and be more

productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-

hadoop-by-enrico-berti-at-big-data-spain-2014

62

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

63

4 Complementarity

Components of Hadoop ecosystem and Spark ecosystem

can work together each for what it is especially good at

rather than choosing one of them

64

Hadoop ecosystem Spark ecosystem

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

65

4 Complementarity +

bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesos

bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41

66

4 Complementarity +

References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN

cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache

Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-

resource-management

bull YARN vs MESOS Canrsquot We All Just Get

Along httpstrataconfcombig-data-conference-ca-

2015publicscheduledetail40620

67

4 Complementarity +

bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization

strategies (building the DAG with knowledge of data

distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the

need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with

YARN (resource chaining in clusters)

bull Tez supports enterprise security

68

4 Complementarity +

bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better

since it is more ldquostream orientedrdquo has more mature

shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory

parsed data it can be much better when we process

data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native

YARN Integration httphortonworkscomblogimproving-spark-data-

pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

69

4 Complementarity

bull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal

compute framework at each step in the big data

analytics process based on the type of platform the

attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on

November 13 2014 with Matt Schumpert Director of Product

Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution

Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-

right-execution-enginehtml

70

4 Complementarity

bull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles

Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-

2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to

Mainstream Apache Hadoop Adoption February 12 2015

httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-

Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms

February 23 2015

httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-

migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015

httpblogsyncsortcom201503framework-future-hadoop

71

5 Key Takeaways1 Evolution of compute models is still ongoing

Watch out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

72

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

73

1 File System

Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml

bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift

74

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 62: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

3 Integration

bull HUE is the open source Apache Hadoop Web UI

that lets users use Hadoop directly from their

browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark

Igniter lets users execute and monitor Spark jobs

directly from their browser and be more

productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-

hadoop-by-enrico-berti-at-big-data-spain-2014

62

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

63

4 Complementarity

Components of Hadoop ecosystem and Spark ecosystem

can work together each for what it is especially good at

rather than choosing one of them

64

Hadoop ecosystem Spark ecosystem

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

65

4 Complementarity +

bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesos

bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41

66

4 Complementarity +

References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN

cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache

Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-

resource-management

bull YARN vs MESOS Canrsquot We All Just Get

Along httpstrataconfcombig-data-conference-ca-

2015publicscheduledetail40620

67

4 Complementarity +

bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization

strategies (building the DAG with knowledge of data

distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the

need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with

YARN (resource chaining in clusters)

bull Tez supports enterprise security

68

4 Complementarity +

bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better

since it is more ldquostream orientedrdquo has more mature

shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory

parsed data it can be much better when we process

data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native

YARN Integration httphortonworkscomblogimproving-spark-data-

pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

69

4 Complementarity

bull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal

compute framework at each step in the big data

analytics process based on the type of platform the

attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on

November 13 2014 with Matt Schumpert Director of Product

Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution

Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-

right-execution-enginehtml

70

4 Complementarity

bull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles

Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-

2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to

Mainstream Apache Hadoop Adoption February 12 2015

httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-

Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms

February 23 2015

httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-

migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015

httpblogsyncsortcom201503framework-future-hadoop

71

5 Key Takeaways1 Evolution of compute models is still ongoing

Watch out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

72

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

73

1 File System

Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml

bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift

74

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 63: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

III Spark with Hadoop

1 Evolution

2 Transition

3 Integration

4 Complementarity

5 Key Takeaways

63

4 Complementarity

Components of Hadoop ecosystem and Spark ecosystem

can work together each for what it is especially good at

rather than choosing one of them

64

Hadoop ecosystem Spark ecosystem

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

65

4 Complementarity +

bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesos

bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41

66

4 Complementarity +

References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN

cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache

Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-

resource-management

bull YARN vs MESOS Canrsquot We All Just Get

Along httpstrataconfcombig-data-conference-ca-

2015publicscheduledetail40620

67

4 Complementarity +

bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization

strategies (building the DAG with knowledge of data

distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the

need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with

YARN (resource chaining in clusters)

bull Tez supports enterprise security

68

4 Complementarity +

bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better

since it is more ldquostream orientedrdquo has more mature

shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory

parsed data it can be much better when we process

data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native

YARN Integration httphortonworkscomblogimproving-spark-data-

pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

69

4 Complementarity

bull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal

compute framework at each step in the big data

analytics process based on the type of platform the

attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on

November 13 2014 with Matt Schumpert Director of Product

Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution

Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-

right-execution-enginehtml

70

4 Complementarity

bull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles

Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-

2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to

Mainstream Apache Hadoop Adoption February 12 2015

httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-

Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms

February 23 2015

httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-

migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015

httpblogsyncsortcom201503framework-future-hadoop

71

5 Key Takeaways1 Evolution of compute models is still ongoing

Watch out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

72

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

73

1 File System

Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml

bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift

74

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 64: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

4 Complementarity

Components of Hadoop ecosystem and Spark ecosystem

can work together each for what it is especially good at

rather than choosing one of them

64

Hadoop ecosystem Spark ecosystem

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

65

4 Complementarity +

bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesos

bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41

66

4 Complementarity +

References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN

cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache

Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-

resource-management

bull YARN vs MESOS Canrsquot We All Just Get

Along httpstrataconfcombig-data-conference-ca-

2015publicscheduledetail40620

67

4 Complementarity +

bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization

strategies (building the DAG with knowledge of data

distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the

need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with

YARN (resource chaining in clusters)

bull Tez supports enterprise security

68

4 Complementarity +

bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better

since it is more ldquostream orientedrdquo has more mature

shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory

parsed data it can be much better when we process

data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native

YARN Integration httphortonworkscomblogimproving-spark-data-

pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

69

4 Complementarity

bull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal

compute framework at each step in the big data

analytics process based on the type of platform the

attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on

November 13 2014 with Matt Schumpert Director of Product

Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution

Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-

right-execution-enginehtml

70

4 Complementarity

bull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles

Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-

2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to

Mainstream Apache Hadoop Adoption February 12 2015

httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-

Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms

February 23 2015

httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-

migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015

httpblogsyncsortcom201503framework-future-hadoop

71

5 Key Takeaways1 Evolution of compute models is still ongoing

Watch out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

72

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

73

1 File System

Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml

bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift

74

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 65: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

65

4 Complementarity +

bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesos

bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41

66

4 Complementarity +

References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN

cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache

Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-

resource-management

bull YARN vs MESOS Canrsquot We All Just Get

Along httpstrataconfcombig-data-conference-ca-

2015publicscheduledetail40620

67

4 Complementarity +

bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization

strategies (building the DAG with knowledge of data

distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the

need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with

YARN (resource chaining in clusters)

bull Tez supports enterprise security

68

4 Complementarity +

bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better

since it is more ldquostream orientedrdquo has more mature

shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory

parsed data it can be much better when we process

data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native

YARN Integration httphortonworkscomblogimproving-spark-data-

pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

69

4 Complementarity

bull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal

compute framework at each step in the big data

analytics process based on the type of platform the

attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on

November 13 2014 with Matt Schumpert Director of Product

Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution

Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-

right-execution-enginehtml

70

4 Complementarity

bull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles

Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-

2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to

Mainstream Apache Hadoop Adoption February 12 2015

httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-

Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms

February 23 2015

httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-

migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015

httpblogsyncsortcom201503framework-future-hadoop

71

5 Key Takeaways1 Evolution of compute models is still ongoing

Watch out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

72

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

73

1 File System

Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml

bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift

74

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 66: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

4 Complementarity +

bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesos

bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41

66

4 Complementarity +

References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN

cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache

Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-

resource-management

bull YARN vs MESOS Canrsquot We All Just Get

Along httpstrataconfcombig-data-conference-ca-

2015publicscheduledetail40620

67

4 Complementarity +

bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization

strategies (building the DAG with knowledge of data

distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the

need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with

YARN (resource chaining in clusters)

bull Tez supports enterprise security

68

4 Complementarity +

bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better

since it is more ldquostream orientedrdquo has more mature

shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory

parsed data it can be much better when we process

data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native

YARN Integration httphortonworkscomblogimproving-spark-data-

pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

69

4 Complementarity

bull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal

compute framework at each step in the big data

analytics process based on the type of platform the

attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on

November 13 2014 with Matt Schumpert Director of Product

Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution

Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-

right-execution-enginehtml

70

4 Complementarity

bull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles

Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-

2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to

Mainstream Apache Hadoop Adoption February 12 2015

httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-

Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms

February 23 2015

httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-

migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015

httpblogsyncsortcom201503framework-future-hadoop

71

5 Key Takeaways1 Evolution of compute models is still ongoing

Watch out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

72

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

73

1 File System

Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml

bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift

74

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 67: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

4 Complementarity +

References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN

cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache

Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-

resource-management

bull YARN vs MESOS Canrsquot We All Just Get

Along httpstrataconfcombig-data-conference-ca-

2015publicscheduledetail40620

67

4 Complementarity +

bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization

strategies (building the DAG with knowledge of data

distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the

need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with

YARN (resource chaining in clusters)

bull Tez supports enterprise security

68

4 Complementarity +

bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better

since it is more ldquostream orientedrdquo has more mature

shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory

parsed data it can be much better when we process

data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native

YARN Integration httphortonworkscomblogimproving-spark-data-

pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

69

4 Complementarity

bull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal

compute framework at each step in the big data

analytics process based on the type of platform the

attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on

November 13 2014 with Matt Schumpert Director of Product

Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution

Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-

right-execution-enginehtml

70

4 Complementarity

bull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles

Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-

2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to

Mainstream Apache Hadoop Adoption February 12 2015

httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-

Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms

February 23 2015

httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-

migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015

httpblogsyncsortcom201503framework-future-hadoop

71

5 Key Takeaways1 Evolution of compute models is still ongoing

Watch out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

72

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

73

1 File System

Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml

bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift

74

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 68: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

4 Complementarity +

bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization

strategies (building the DAG with knowledge of data

distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the

need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with

YARN (resource chaining in clusters)

bull Tez supports enterprise security

68

4 Complementarity +

bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better

since it is more ldquostream orientedrdquo has more mature

shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory

parsed data it can be much better when we process

data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native

YARN Integration httphortonworkscomblogimproving-spark-data-

pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

69

4 Complementarity

bull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal

compute framework at each step in the big data

analytics process based on the type of platform the

attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on

November 13 2014 with Matt Schumpert Director of Product

Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution

Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-

right-execution-enginehtml

70

4 Complementarity

bull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles

Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-

2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to

Mainstream Apache Hadoop Adoption February 12 2015

httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-

Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms

February 23 2015

httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-

migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015

httpblogsyncsortcom201503framework-future-hadoop

71

5 Key Takeaways1 Evolution of compute models is still ongoing

Watch out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

72

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

73

1 File System

Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml

bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift

74

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 69: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

4 Complementarity +

bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better

since it is more ldquostream orientedrdquo has more mature

shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory

parsed data it can be much better when we process

data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native

YARN Integration httphortonworkscomblogimproving-spark-data-

pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

69

4 Complementarity

bull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal

compute framework at each step in the big data

analytics process based on the type of platform the

attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on

November 13 2014 with Matt Schumpert Director of Product

Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution

Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-

right-execution-enginehtml

70

4 Complementarity

bull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles

Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-

2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to

Mainstream Apache Hadoop Adoption February 12 2015

httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-

Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms

February 23 2015

httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-

migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015

httpblogsyncsortcom201503framework-future-hadoop

71

5 Key Takeaways1 Evolution of compute models is still ongoing

Watch out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

72

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

73

1 File System

Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml

bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift

74

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 70: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

4 Complementarity

bull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal

compute framework at each step in the big data

analytics process based on the type of platform the

attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on

November 13 2014 with Matt Schumpert Director of Product

Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution

Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-

right-execution-enginehtml

70

4 Complementarity

bull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles

Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-

2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to

Mainstream Apache Hadoop Adoption February 12 2015

httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-

Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms

February 23 2015

httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-

migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015

httpblogsyncsortcom201503framework-future-hadoop

71

5 Key Takeaways1 Evolution of compute models is still ongoing

Watch out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

72

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

73

1 File System

Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml

bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift

74

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 71: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

4 Complementarity

bull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles

Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-

2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to

Mainstream Apache Hadoop Adoption February 12 2015

httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-

Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms

February 23 2015

httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-

migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015

httpblogsyncsortcom201503framework-future-hadoop

71

5 Key Takeaways1 Evolution of compute models is still ongoing

Watch out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

72

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

73

1 File System

Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml

bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift

74

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 72: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

5 Key Takeaways1 Evolution of compute models is still ongoing

Watch out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

72

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

73

1 File System

Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml

bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift

74

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 73: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

73

1 File System

Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml

bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift

74

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 74: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

1 File System

Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml

bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift

74

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 75: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

1 File System

When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS

HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfs

bull hellip

75

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 76: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

76

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 77: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

2 Deployment

While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clusters

bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge

bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

77

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 78: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

6 Key Takeaways

78

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 79: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

3 Distributions

bull Using Spark on a Non-Hadoop distribution

79

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 80: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

Cloud

bull Databricks Cloud is not dependent on

Hadoop It gets its data from Amazonrsquos S3

(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and

data products in an instant March 4 2015

httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-

insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at

Spark Summit 2014 July 2 2014

httpswwwyoutubecomwatchv=dJQ5lV5Tldw

80

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 81: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

DSE

bull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform

Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter

prisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with

Spark amp Cassandra Piotr Kolaczkowski September 26 2014

httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and

Cassandra with the Spark Cassandra Connector

Helena Edelson published on November 24 2014

httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-

spark-and-cassandra-41950082

81

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 82: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40

82

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 83: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

83

bull xPatterns (httpatigeocomtechnology) is a complete big

data analytics platform available with a novel

architecture that integrates components across

three logical layers Infrastructure Analytics

and Applications

bull xPatterns is cloud-based exceedingly scalable

and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

39

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 84: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

84

bull The BlueData (httpwwwbluedatacom) EPIC software

platform solves the infrastructure challenges and

limitations that can slow down and stall Big Data

deployments

bull With EPIC software you can spin up Hadoop

clusters ndash with the data and analytical tools that

your data scientists need ndash in minutes rather than

months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 85: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

85

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the

Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-

operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes

streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially

compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-

platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 86: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

IV Spark without Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

86

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 87: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

4 AlternativesHadoop Ecosystem Spark Ecosystem

Component

HDFS Tachyon

YARN Mesos

Tools

Pig Spark native API

Hive Spark SQL

Mahout MLlib

Storm Spark Streaming

Giraph GraphX

HUE Spark NotebookISpark

87

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 88: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

bull Tachyon is a memory-centric distributed file

system enabling reliable file sharing at memory-

speed across cluster frameworks such as Spark

and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark

and MapReduce programs can run on top of it

without any code change

bull Tachyon is the storage layer of the Berkeley

Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

88

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 89: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically

take advantage of the idle resources in the cluster during

its execution This leads to considerable performance

improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquo

bull Share datacenter between multiple cluster computing

apps Provide new abstractions and services

bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN

Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos

89

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 90: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

YARN vs MesosCriteria

Resource

sharing

Yes Yes

Written in Java C++

Scheduling Memory only CPU and Memory

Running tasks Unix processes Linux Container groups

Requests Specific requests

and locality

preference

More generic but more

coding for writing

frameworks

Maturity Less mature Relatively more mature

90

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 91: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

Spark Native API

bull Spark Native API in Scala Java and Python

bull Interactive shell in Scala and Python

bull Spark supports Java 8 for a much more concise

Lambda expressions to get code nearly as

simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014

httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-

meetup

bull lsquoSpark Corersquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag

11-core-spark

91

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 92: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

Spark SQL

bull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains

compatibility with Hive It supports all existing Hive data

formats user-defined functions (UDF) and the Hive

metastore

bull Spark SQL also allows manipulating (semi-) structured

data as well as ingesting data from sources that

provide schema such as JSON Parquet Hive or

EDWs It unifies SQL and sophisticated analysis

allowing users to mix and match SQL and more

imperative programming APIs for advanced analytics

92

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 93: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

Spark MLlib

93

lsquoSpark MLlib rsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 94: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

Spark Streaming

94

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-

spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 95: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash

every record

processed

At least one ( may

be duplicates)

Exactly one

Batch Framework

integration

Not available Core Spark API

Supported

languages

Any programming

language

Scala Java

Python

95

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 96: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

GraphX

96

lsquoGraphXrsquo Tag at

SparkBigDatacomhttpsparkbigdatacomcomponent

tagstag6-graphx

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 97: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

Notebook

97

bull Zeppelin httpzeppelin-projectorg is a web-based

notebook that enables interactive data analytics

Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based

editor that can combine Scala code SQL

queries Markup or even JavaScript in a

collaborative manner httpsgithubcomandypetrellaspark-

notebook

bull ISpark is an Apache Spark-shell backend for

IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 98: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

IV Spark on Non-Hadoop

1 File System

2 Deployment

3 Distributions

4 Alternatives

5 Key Takeaways

98

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 99: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage

2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment

3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging

4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another

99

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi

Page 100: Spark or Hadoop: is it an either-or proposition? By Slim Baltagi

IV More QampA

100

httpwwwSparkBigDatacom

sbaltagigmailcom

httpswwwlinkedincominslimbal

tagi

SlimBaltagi

httpwwwslidesharenetsbaltagi