Hadoop or Spark: is it an either-or proposition? By Slim Baltagi

Spark or Hadoop Is it an either-or proposition

By Slim Baltagi (SlimBaltagi)sbaltagigmailcom

httpwwwSparkBigDatacom

Los Angeles Spark Users Group March 12 2015

Your Presenter ndash Slim Baltagibull Sr Big Data Solutions Architect

living in Chicagobull Over 17 years of IT and business

experiencesbull Over 4 years of Big Data

experience working on over 12 Hadoop projects

bull Speaker at Big Data eventsbull Creator and maintainer of the

Apache Spark Knowledge Base httpwwwSparkBigDatacom with over 4000 categorized Apache Spark web resources

SlimBaltagi

httpswwwlinkedincominslimbaltagi

sbaltagigmailcom

Disclaimer This is a vendor-independent talk that expresses my own opinions I am not endorsing nor promoting any product or vendor mentioned in this talk

AgendaI MotivationII Big Data Typical Big Data

Stack Apache Hadoop Apache Spark

III Spark with HadoopIV Spark without HadoopV More QampA

I Motivation1 News 2 Surveys3 Vendors4 Analysts5 Key Takeaways

1 Newsbull Is it Spark vs OR and Hadoopbull Apache Spark Hadoop friend or foebull Apache Spark killer or savior of Apache Hadoopbull Apache Sparks Marriage To Hadoop Will Be Bigger

Than Kim And Kanye bull Adios Hadoop Hola Sparkbull Apache Spark Moving on from Hadoopbull Apache Spark Continues to Spread Beyond Hadoopbull Escape From Hadoopbull Spark promises to end up Hadoop but in a good way

2 Surveysbull Hadoops historic focus on batch processing of data

was well supported by MapReduce but there is an appetite for more flexible developer tools to support the larger market of mid-size datasets and use cases that call for real-time processingrdquo 2015 Apache Spark Survey by Typesafe January 27 2015 httpwwwmarketwiredcompress-releasesurvey-indicates-apache-spark-gaining-developer-adoption-as-big-datas-projects-1986162htm

bull Apache Spark Preparing for the Next Wave of Reactive Big Data January 27 2015 by Typesafehttptypesafecomblogapache-spark-preparing-for-the-next-wave-of-reactive-big-data

Apache Spark Survey 2015 by Typesafe - Quick Snapshot

3 Vendorsbull Spark and Hadoop Working Together January 21

2014 by Ion Stoica httpsdatabrickscomblog20140121spark-and-hadoophtml

bull Uniform API for diverse workloads over diverse storage systems and runtimes Source Slide 16 of lsquoSparks Role in the Big Data Ecosystem (Spark Summit 2014) November 2014 Matei Zahariahttpwwwslidesharenetdatabricksspark-summit2014

bull The goal of Apache Spark is to have one engine for all data sources workloads and environmentsrdquo Source Slide 15 of lsquoNew Directions for Apache Spark in 2015 February 20 2015 Strata + Hadoop Summit Matei Zahariahttpwwwslidesharenetdatabricksnew-directions-for-apache-spark-in-2015

3 Vendorsbull ldquoSpark is already an excellent piece of software and is

advancing very quickly No vendor mdash no new project mdash is likely to catch up Chasing Spark would be a waste of time and would delay availability of real-time analytic and processing services for no good reason rdquoSource MapReduce and Spark December 302013 httpvisionclouderacommapreduce-spark

bull ldquoApache Spark is an open source parallel data processing framework that complements Apache Hadoop to make it easy to develop fast unified Big Data applications combining batch streaming and interactive analytics on all your datardquohttpwwwclouderacomcontentclouderaenproducts-and-servicescdhsparkhtml

3 Vendorsbull ldquoApache Spark is a general-purpose engine for large-

scale data processing Spark supports rapid application development for big data and allows for code reuse across batch interactive and streaming applications Spark also provides advanced execution graphs with in-memory pipelining to speed up end-to-end application performancerdquo httpswwwmaprcomproductsapache-spark

bull MapR Adds Complete Apache Spark Stack to its Distribution for Hadoophttpswwwmaprcomcompanypress-releasesmapr-adds-complete-apache-spark-stack-its-distribution-hadoop

3 Vendorsbull ldquoApache Spark provides an elegant attractive development API and allows data workers to rapidly iterate over data via machine learning and other data science techniques that require fast in-memory data processingrdquo httphortonworkscomhadoopspark

bull Hortonworks A shared vision for Apache Spark on Hadoop October 21 2014httpsdatabrickscomblog20141031hortonworks-a-shared-vision-for-apache-spark-on-hadoophtml

bull ldquoAt Hortonworks we love Spark and want to help our customers leverage all its benefitsrdquo October 30th 2014httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

4 Analystsbull Is Apache Spark replacing Hadoop or complementing

existing Hadoop practicebull Both are already happening

bull With uncertainty about ldquowhat is Hadooprdquo there is no reason to think solution stacks built on Spark not positioned as Hadoop will not continue to proliferate as the technology matures

bull At the same time Hadoop distributions are all embracing Spark and including it in their offerings

Source Hadoop Questions from Recent Webinar Span Spectrum February 25 2015httpblogsgartnercommerv-adrian20150225hadoop-questions-from-recent-webinar-span-spectrum

4 Analysts bull ldquoAfter hearing the confusion between Spark and Hadoop

one too many times I was inspired to write a report The Hadoop Ecosystem Overview Q4 2104

bull For those that have day jobs that donrsquot include constantly tracking Hadoop evolution I dove in and worked with Hadoop vendors and trusted consultants to create a framework

bull We divided the complex Hadoop ecosystem into a core set of tools that all work closely with data stored in Hadoop File System and extended group of components that leverage but do not require itrdquo Source Elephants Pigs Rhinos and Giraphs Oh My ndash Its Time To Get A Handle On Hadoop Posted by Brian Hopkins on November 26 2014

httpblogsforrestercombrian_hopkins14-11-26-elephants_pigs_rhinos_and_giraphs_oh_my_its_time_to_get_a_handle_on_hadoop

5 Key Takeaways1 News Big Data is no longer a Hadoop

monopoly2 Surveys Listen to what Spark developers are

saying 3 Vendors ltHadoop Vendorgt-tinted goggles

FUD is still being lsquoofferedrsquo by some Hadoop vendors Claims need to be contextualized

4 Analysts Thorough understanding of the market dynamics

II Big Data Typical Big Data Stack Hadoop Spark

1 Big Data2 Typical Big Data Stack 3 Apache Hadoop4 Apache Spark5 Key Takeaways

1 Big Databull Big Data is still one of the most inflated buzzword of the last years

bull Big Data is a broad term for data sets so large or complex that traditional data processing tools are inadequate httpenwikipediaorgwikiBig_data

bull Hadoop is becoming a traditional tool Above definition is inadequate

bull ldquoBig Data refers to datasets and flows large enough that has outpaced our capability to store process analyze and understandrdquo Amir H Payberah Swedish Institute of Computer Science (SICS)

2 Typical Big Data Stack

3 Apache Hadoopbull Apache Hadoop as an example of a Typical Big Data

Stack bull Hadoop ecosystem = Hadoop Stack + many other tools

(either open source and free or commercial ones)bull Big Data Ecosystem Dataset httpbigdataandreamostosiname

Incomplete but a useful list of Big Data related projects packed into a JSON dataset

bull Hadoops Impact on Data Managements Future - Amr Awadallah (Strata + Hadoop 2015) February 19 2015 Watch video at 236 on lsquoHadoop Isnrsquot Just Hadoop Anymorersquo for a picture representing the evolution of Apache Hadoop httpswwwyoutubecomwatchv=1KvTZZAkHy0

4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stackbull Apache Spark provides you Big Data computing and more

bull BYOS Bring Your Own Storage bull BYOC Bring Your Own Clusterbull Spark Core httpsparkbigdatacomcomponenttagstag11-core-sparkbull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-

streamingbull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sqlbull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllibbull GraphX httpsparkbigdatacomcomponenttagstag6-graphx

bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned

5 Key Takeaways1 Big Data Still one of the most inflated

buzzword 2 Typical Big Data Stack Big Data Stacks look

similar on paper Arenrsquot they3 Apache Hadoop Hadoop is no longer

lsquosynonymousrsquo of Big Data4 Apache Spark Emergence of the Apache

Spark ecosystem

III Spark with Hadoop1 Evolution 2 Transition3 Integration4 Complementarity5 Key Takeaways

1 Evolution of Programming APIsbull MapReduce in Java is like assembly code of Big

Data httpwikiapacheorghadoopWordCount

bull Pig httppigapacheorg

bull Hive httphiveapacheorg

bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi

bull Cascading httpwwwcascadingorg

bull Scalding A Scala API for Cascading httptwittercomscalding

bull Crunch httpcrunchapacheorg

bull Scrunch httpcrunchapacheorgscrunchhtml

1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007 MapReduce v1 was the only choice as a compute model (Execution Engine) on Hadoop Now we have in addition to MapReduce v2 Tez Spark and Flink

bull Batch bull Batchbull Interactive

bull Batchbull Interactivebull Near-Real

bull Batchbull Interactivebull Real-Timebull Iterative

bull 1st Generation

bull 2nd

Generationbull 3rd

Generationbull 4th

Generation

1 Evolutionbull This is how Hadoop MapReduce is branding itself ldquoA YARN-based

system for parallel processing of large data sets httphadoopapacheorg

bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip

bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job

bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics

1 Evolutionbull Tez Hindi for ldquospeedrdquobull This is how Apache Tez is branding itself ldquoThe Apache Tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data It is currently built atop YARNrdquo

Source httptezapacheorg

bull Apachetrade Tez is an extensible framework for building high performance batch and interactive data processing applications coordinated by YARN in Apache Hadoop

1 Evolution bull lsquoSparkrsquo for lightning fast speedbull This is how Apache Spark is branding itself ldquoApache Sparktrade is a fast and general engine for large-scale data processingrdquo httpssparkapacheorg

bull Apache Spark is a general purpose cluster computing framework its execution model supports wide variety of use cases batch interactive near-real time

bull The rapid in-memory processing of resilient distributed datasets (RDDs) is the ldquocore capabilityrdquo of Apache Spark

1 Evolution Apache Flinkbull Flink German for ldquonimble swift speedyrdquobull This is how Apache Flink is branding itself ldquoFast and

reliable large-scale data processing enginerdquobull Apache Flink httpflinkapacheorg offers

bull Batch and Streaming in the same systembull Beyond DAGs (Cyclic operator graphs)bull Powerful expressive APIsbull Inside-the-system iterationsbull Full Hadoop compatibility bull Automatic language independent optimizer

bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink

Hadoop MapReduce vs Tez vs SparkCriteria

License Open SourceApache 20 version 2x

Open SourceApache 20 version 0x

Processing Model

On-Disk (Disk- based parallelization) Batch

On-Disk Batch Interactive

In-Memory On-Disk Batch Interactive Streaming (Near Real-Time)

Language written in

Java Java Scala

API [Java Python Scala] User-Facing

Java[ ISVEngineTool builder]

[Scala Java Python] User-Facing

Libraries None separate tools None [Spark Core Spark Streaming Spark SQL MLlib GraphX]

Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to Hadoop

Ease of Use Difficult to program needs abstractions

No Interactive mode except Hive Pig

Difficult to program

Easy to program no need of abstractionsInteractive mode

Compatibility

to data types and data sources is same

YARN integration

YARN application Ground up YARN application

Spark is moving towards YARN

Deployment YARN YARN [Standalone YARN SIMR Mesos hellip]

Performance - Good performance when data fits into memory

- performance degradation otherwise

Security More features and projects

More features and projects

Still in its infancy

Partial support

IV Spark with Hadoop

1 Evolution2 Transition3 Integration4 Complementarity5 Key Takeaways

2 Transitionbull Existing Hadoop MapReduce projects can migrate to Spark and leverage Spark Core as execution engine1 You can often reuse your mapper and

reducer functions and just call them in Spark from Java or Scala

2 You can translate your code from MapReduce to Apache Spark How-to Translate from MapReduce to Apache Spark

httpblogclouderacomblog201409how-to-translate-from-mapreduce-to-apache-spark

2 Transition3 The following tools originally based on Hadoop MapReduce are being ported to Apache Spark

bull Pig Hive Sqoop Cascading Crunch Mahout hellip

Pig on Spark (Spork)bull Run Pig with ldquondashx sparkrdquo option for an easy migration

without development effortbull Speed up your existing pig scripts on Spark ( Query

Logical Plan Physical Pan)bull Leverage new Spark specific operators in Pig such as

Cachebull Still leverage many existing Pig UDF librariesbull Pig on Spark Umbrella Jira (Status Passed end-to-end test

cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059bull Fix outstanding issues and address additional Spark functionality

through the community

bull lsquoPig on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag19

Hive on Spark (Currently in Beta Expected in Hive 110)

bull New alternative to using MapReduce or Tez hivegt set hiveexecutionengine=sparkbull Help existing Hive applications running on

MapReduce or Tez easily migrate to Spark without development effort

bull Exposes Spark users to a viable feature-rich de facto standard SQL tool on Hadoop

bull Performance benefits especially for Hive queries involving multiple reducer stages

bull Hive on Spark Umbrella Jira (Status Open) Q1 2015 httpsissuesapacheorgjirabrowseHIVE-7292

bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-motivations-and-design-principles

bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Started

bull Hive on Spark February 11 2015 Szehon Ho Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark

bull Hive on spark is blazing fast or is it Carter Shanklin and Mostapah Mokhtar (Hortonworks) February 20 2015 httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final

bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12

Sqoop on Spark (Expected in Sqoop 2)

bull Sqoop ( aka from SQL to Hadoop) was initially developed as a tool to transfer data from RDBMS to Hadoop

bull The next version of Sqoop referred to as Sqoop2 supports data transfer across any two data sources

bull Sqoop 2 Proposal is still under discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Proposal

bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira Status Work In Progress) The goal of this ticket is to support a pluggable way to select the execution engine on which we can run the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532

(Expected in 31 release)

bull Cascading httpwwwcascadingorg is an application development platform for building data applications on Hadoop

bull Support for Apache Spark is on the roadmap and will be available in Cascading 31 release

Source httpwwwcascadingorgnew-fabric-support

bull Spark-scalding is a library that aims to make the transition from CascadingScalding to Spark a little easier by adding support for Cascading Taps Scalding Sources and the Scalding Fields API in Spark Source httpscaldingio201410running-scalding-on-apache-spark

Apache Crunchbull The Apache Crunch Java library provides a framework for writing testing and running MapReduce pipelines httpscrunchapacheorg

bull Apache Crunch 011 releases with a SparkPipeline class making it easy to migrate data processing applications from MapReduce to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSparkPipelinehtml

bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-xtopicscdh_ig_running_crunch_with_sparkhtml

(Expec (Expected in Mahout 10 )

bull Mahout News 25 April 2014 - Goodbye MapReduce Apache Mahout the original Machine Learning (ML) library for Hadoop since 2009 is rejecting new MapReduce algorithm implementationshttpmahoutapacheorg

bull Integration of Mahout and Spark bull Reboot with new Mahout Scala DSL for Distributed

Machine Learning on Spark Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark

bull Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL httpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml

(Expected in Mahout 10 )

bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml

bull Mahout scala and spark bindings Dmitriy Lyubimov April 2014httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings

bull Co-occurrence Based Recommendations with Mahout Scala and Spark Published on May 30 2014httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-with-mahout-scala-and-spark

bull Mahout 10 Features by Engine (unreleased)- MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml

3 IntegrationService Open Source Tool

StorageServing Layer

Data Formats

Data Ingestion ServicesResource Management

Search

3 Integrationbull Spark was designed to read and write data from and to HDFS as

well as other storage systems supported by Hadoop API such as your local file system Hive HBase Cassandra and Amazonrsquos S3

bull Stronger integration between Spark and HDFS caching (SPARK-1767) to allow multiple tenants and processing frameworks to share the same in-memory httpsissuesapacheorgjirabrowseSPARK-1767

bull Use DDM Discardable Distributed Memory httphortonworkscomblogddm to store RDDs in memoryThis allows many Spark applications to share RDDs since they are now resident outside the address space of the application Related HDFS-5851 is planned for Hadoop 30 httpsissuesapacheorgjirabrowseHDFS-5851

3 Integrationbull Out of the box Spark can interface with HBase as it has

full support for Hadoop InputFormats via newAPIHadoopRDD Example HBaseTestscala from Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapachesparkexamplesHBaseTestscala

bull There are also Spark RDD implementations available for reading from and writing to HBase without the need of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector

bull SparkOnHBase is a project for HBase integration with Spark Status Still in experimentation and no timetable for possible support httpblogclouderacomblog201412new-in-cloudera-labs-sparkonhbase

3 Integration bull Spark Cassandra Connector This library lets you

expose Cassandra tables as Spark RDDs write Spark RDDs to Cassandra tables and execute arbitrary CQL queries in your Spark applications Supports also integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector

bull Spark + Cassandra using Deep The integration is not based on the Cassandras Hadoop interface httpstratiogithubiodeep-spark

bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra

bull lsquoCassandrarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag20-cassandra

3 Integration bull Benchmark of Spark amp Cassandra Integration using different approacheshttpwwwstratiocomdeep-vs-datastax

bull Calliope is a library providing an interface to consume data from Cassandra to spark and store Resilient Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope

bull Cassandra storage backend with Spark is opening many new avenues

bull Kindling An Introduction to Spark with Cassandra (Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-spark-with-cassandra

3 Integration bull MongoDB is not directly served by Spark although it can be used from Spark via an official Mongo-Hadoop connectorbull MongoDB-Spark Demo

httpsgithubcomcrcsmnkymongodb-spark-demo

bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-insights

bull Spark SQL also provides indirect support via its support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop

3 Integration bull There is also NSMC Native Spark MongoDB Connector

for reading and writing MongoDB collections directly from Apache Spark (still experimental)bull GitHub httpsgithubcomspiromspark-mongodb-connector

bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-introdu

ction-setup PART 1

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-example Part 2

bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-example-key-takeaways PART 3

bull Interesting blog on Using Spark with MongoDB without Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml

3 Integration bull Neo4j is a highly scalable robust (fully ACID) native graph

databasebull Getting Started with Apache Spark and Neo4j Using

Docker Compose By Kenny Bastani March 10 2015httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml

bull Categorical PageRank Using Neo4j and Apache Spark By Kenny Bastani January 19 2015 httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml

bull Using Apache Spark and Neo4j for Big Data Graph Analytics By Kenny Bastani November 3 2014httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml

3 Integration YARN bull YARN Yet Another Resource Negotiator Implicit reference to Mesos as the Resource Negotiator

bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND20summary20~20yarn20AND20status203D20OPEN20ORDER20BY20priority20DESC0A

bull Some issues are critical ones bull Running Spark on YARN

httpsparkapacheorgdocslatestrunning-on-yarnhtml

bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU

3 Integrationbull Spark SQL provides built in support for Hive tables

bull Import relational data from Hive tablesbull Run SQL queries over imported data bull Easily write RDDs out to Hive tables

bull Hive 013 is supported in Spark 120bull Support of ORCFile (Optimized Row Columnar file) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883

bull Hive can be used both for analytical queries and for fetching dataset machine learning algorithms in MLlib

3 Integration bull Drill is intended to achieve the sub-second latency

needed for interactive data analysis and exploration httpdrillapacheorg

bull Drill and Spark Integration is work in progress in 2015 to address new use cases bull Use a Drill query (or view) as the input to Spark Drill

extracts and pre-processes data from various data sources and turns it into input to Spark

bull Use Drill to query Spark RDDs Use BI tools to query in-memory data in Spark Embed Drill execution in a Spark data pipeline

Source Whats Coming in 2015 for Drillhttpdrillapacheorgblog20141216whats-coming-in-2015

3 Integrationbull Apache Kafka is a high throughput distributed messaging system httpkafkaapacheorg

bull Spark Streaming integrates natively with Kafka Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml

bull Tutorial Integrating Kafka and Spark Streaming Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-example-tutorial

bull lsquoKafkarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag24-kafka

3 Integrationbull Apache Flume is a streaming event data ingestion system that is designed for Big Data ecosystem httpflumeapacheorg

bull Spark Streaming integrates natively with Flume There are two approaches to thisbull Approach 1 Flume-style Push-based Approachbull Approach 2 (Experimental) Pull-based Approach using a Custom Sink

bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml

3 Integrationbull Spark SQL provides built in support for JSON that is vastly simplifying the end-to-end-experience of working with JSON data

bull Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD No more DDL Just point Spark SQL to JSON files and query Starting Spark 13 SchemaRDD will be renamed to DataFrame

bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-support-in-spark-sqlhtml

3 Integrationbull Apache Parquet is a columnar storage format available

to any project in the Hadoop ecosystem regardless of the choice of data processing framework data model or programming languagehttpparquetincubatorapacheorg

bull Built in support in Spark SQL allows tobull Import relational data from Parquet files bull Run SQL queries over imported databull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files

bull This is an illustrating example of integration of Parquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet

3 Integrationbull Spark SQL Avro Library for querying Avro data with Spark

SQL This library requires Spark 12+ httpsgithubcomdatabricksspark-avro

bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro

bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015bull Problem

bull Various inbound data setsbull Data Layout can change without noticebull New data sets can be added without noticeResultbull Leverage Spark to dynamically split the databull Leverage Avro to store the data in a compact binary format

3 Integration Kite SDKbull The Kite SDK provides high level abstractions to work with datasets on Hadoop hiding many of the details of compression codecs file formats partitioning strategies etchttpkitesdkorgdocscurrent

bull Spark support has been added to Kite 016 release so Spark jobs can read and write to Kite datasets

bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark

3 Integration bull Elasticsearch is a real-time distributed search and analytics

engine httpwwwelasticsearchorg

bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml

bull Deep-Spark provides also an integration with Spark httpsgithubcomStratiodeep-spark

bull elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark in the form of RDD that can read data from Elasticsearch Also any RDD can be saved to Elasticsearch as long as its content can be translated into documents httpsgithubcomelasticelasticsearch-hadoop

bull Great use case by NTT Data integrating Apache Spark Streaming and Elasticsearch httpwwwintellilinkcojparticlecolumnbigdata-kk02html

3 Integration

bull Apache Solr added a Spark-based indexing tool for fast and easy indexing ingestion and serving searchable complex data ldquoCrunchIndexerTool on Sparkrdquo

bull Solr-on-Spark solution using Apache Solr Spark Crunch and Morphlines bull Migrate ingestion of HDFS data into Solr from

MapReduce to Sparkbull Update and delete existing documents in Solr at scale

bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-intosolrusingsparktrimmed

3 Integration bull HUE is the open source Apache Hadoop Web UI that lets users use Hadoop directly from their browser and be productive httpwwwgethuecom

bull A Hue application for Apache Spark called Spark Igniter lets users execute and monitor Spark jobs directly from their browser and be more productive

bull Demo of Spark Igniter httpvimeocom83192197

bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-hadoop-by-enrico-berti-at-big-data-spain-2014

4 ComplementarityComponents of Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at rather than choosing one of them

Hadoop ecosystem Spark ecosystem

4 Complementarity + +

bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS

bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark

bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml

4 Complementarity + bull Mesos and YARN can work together each for what

it is especially good at rather than choosing one of the two for Spark deployment

bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo

bull Project Myriad is an open source framework for running YARN on Mesosbull lsquoMyriadrsquo Tag at SparkBigDatacom

httpsparkbigdatacomcomponenttagstag41

4 Complementarity + References

bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E

bull Myriad A Mesos framework for scaling a YARN cluster httpsgithubcommesosmyriad

bull Myriad Project Marries YARN and Apache Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-resource-management

bull YARN vs MESOS Canrsquot We All Just Get Along httpstrataconfcombig-data-conference-ca-2015publicscheduledetail40620

4 Complementarity + bull Spark on Tez for efficient ETL https

githubcomhortonworksspark-native-yarn

bull Tez could takes care of the pure Hadoop optimization strategies (building the DAG with knowledge of data distribution statistics orhellip HDFS caching)

bull Spark execution layer could be leveraged without the need of a nasty SparkHadoop coupling

bull Tez is good on fine-grained resource isolation with YARN (resource chaining in clusters)

bull Tez supports enterprise security

4 Complementarity + bull Data gtgt RAM Processing huge data volumes

much bigger than cluster RAM Tez might be better since it is more ldquostream orientedrdquo has more mature shuffling implementation closer YARN integration

bull Data ltlt RAM Since Spark can cache in memory parsed data it can be much better when we process data smaller than clusterrsquos memory

bull Improving Spark for Data Pipelines with Native YARN Integration httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration

bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU

4 Complementaritybull Emergence of the lsquoSmart Execution Enginersquo Layer

Smart Execution Engine dynamically selects the optimal compute framework at each step in the big data analytics process based on the type of platform the attributes of the data and the condition of the cluster

bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on November 13 2014 with Matt Schumpert Director of Product Management at Datameer

bull The Challenge to Choosing the ldquoRightrdquo Execution Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-right-execution-enginehtml

4 Complementaritybull Operating in a Multi-execution Engine Hadoop Environment by

Erik Halseth of Datameer on January 27th 2015 at the Los Angeles Big Data Users Group

bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-2015pdf

bull New Syncsort Big Data Software Removes Major Barriers to Mainstream Apache Hadoop Adoption February 12 2015httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366

bull Syncsort Automates Data Migrations Across Multiple Platforms February 23 2015httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-migrations-across-multiple-platformshtml

bull Framework for the Future of Hadoop March 9 2015 httpblogsyncsortcom201503framework-future-hadoop

5 Key Takeaways1 Evolution of compute models is still ongoing Watch

out Apache Flink project for true low-latency and iterative use cases

2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity

3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way

4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all

IV Spark without Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

1 File SystemSpark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example

1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)

2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml

3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13

4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3

bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html

bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots

5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtmlbull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationt

he-perfect-match-apache-spark-meets-swift

1 File SystemWhen coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs

A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS HDFS

storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage

bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance

bull Quantcast QFS httpswwwquantcastcomengineeringqfsbull hellip

2 DeploymentWhile Spark is most often discussed as a replacement for MapReduce in Hadoop clusters to be deployed on YARN Spark is actually agnostic to the underlying infrastructure for clustering so alternative deployments are possible

1 Local httpsparkbigdatacomtutorials51-deployment121-local

2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone

3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos

4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2

5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr

6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace

7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud

8 HPC Clustersbull Setting up Spark on top of SunOracle Grid Engine (PSI) -

httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sgebull Setting up Spark on the Brutus and Euler Clusters (ETH) -

httpsparkbigdatacomtutorials51-deployment128-hpc-cluster

3 Distributionsbull Using Spark on a Non-Hadoop distribution

bull Databricks Cloud is not dependent on Hadoop It gets its data from Amazonrsquos S3 (most commonly) Redshift Elastic MapReduce httpsdatabrickscomproductdatabricks-cloud

bull Databricks Cloud From raw data to insights and data products in an instant March 4 2015httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-insights-and-data-products-in-an-instanthtml

bull Databricks Cloud Announcement and Demo at Spark Summit 2014 July 2 2014 httpswwwyoutubecomwatchv=dJQ5lV5Tldw

DSEbull DSE DataStax Enterprise built on Apache Cassandra

presents itself as a Non-Hadoop Big Data Platform Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enterprisesparksparkTOChtml

bull Escape from Hadoop Ultra Fast Data Analysis with Spark amp Cassandra Piotr Kolaczkowski September 26 2014httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4

bull Escape from Hadoop with Apache Spark and Cassandra with the Spark Cassandra ConnectorHelena Edelson published on November 24 2014 httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-spark-and-cassandra-41950082

bull Stratio is a Big Data platform based on Spark It is 100 open source and enterprise ready httpwwwstratiocom

bull Streaming-CEP-Engine Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine httpstratiogithubiostreaming-cep-engine

bull lsquoStratiorsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag40

bull xPatterns (httpatigeocomtechnology) is a complete big data analytics platform available with a novel architecture that integrates components across three logical layers Infrastructure Analytics and Applications

bull xPatterns is cloud-based exceedingly scalable and readily interfaces with existing IT systems

bull lsquoxPatternsrsquo Tag at SparkBigDatacomhttp

sparkbigdatacomcomponenttagstag39

bull The BlueData (httpwwwbluedatacom) EPIC software platform solves the infrastructure challenges and limitations that can slow down and stall Big Data deployments

bull With EPIC software you can spin up Hadoop clusters ndash with the data and analytical tools that your data scientists need ndash in minutes rather than months httpswwwyoutubecomwatchv=SE1OP4ImrxU

bull lsquoBlueDatarsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag37

bull Guavus (httpwwwguavuscom) embeds Apache Spark into

its Operational Intelligence Platform Deployed at the Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml

bull Guavus operational intelligence platform analyzes streaming data and data at rest

bull The Guavus Reflex 20 platform is commercially compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-platform-now-certified-spark-distribution

bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38

4 AlternativesHadoop Ecosystem Spark EcosystemComponent

HDFS Tachyon YARN Mesos

ToolsPig Spark native APIHive Spark SQL

Mahout MLlibStorm Spark StreamingGiraph GraphXHUE Spark NotebookISpark

bull Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks such as Spark and MapReduce httpshttptachyon-projectorg

bull Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code change

bull Tachyon is the storage layer of the Berkeley Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware

bull Mesos (httpmesosapacheorg) enables fine

grained sharing which allows a Spark job to dynamically take advantage of the idle resources in the cluster during its execution This leads to considerable performance improvements especially for long running Spark jobs

bull Mesos as Data Center ldquoOSrdquobull Share datacenter between multiple cluster computing

apps Provide new abstractions and services bull Mesosphere DCOS Datacenter services including

Apache Spark Apache Cassandra Apache YARN Apache HDFShellip

bull lsquoMesosrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag16-mesos

YARN vs MesosCriteria

Resource sharing

Yes Yes

Written in Java C++Scheduling Memory only CPU and MemoryRunning tasks Unix processes Linux Container groups

Requests Specific requests and locality preference

More generic but more coding for writing frameworks

Maturity Less mature Relatively more mature

Spark Native APIbull Spark Native API in Scala Java and Pythonbull Interactive shell in Scala and Pythonbull Spark supports Java 8 for a much more concise Lambda expressions to get code nearly as simple as the Scala API

bull ETL with Spark - First Spark London Meetup May 28 2014httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-meetup

bull lsquoSpark Corersquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag11-core-spark

Spark SQLbull Spark SQL is a new SQL engine designed from

ground-up for Spark httpssparkapacheorgsql

bull Spark SQL provides SQL performance and maintains compatibility with Hive It supports all existing Hive data formats user-defined functions (UDF) and the Hive metastore

bull Spark SQL also allows manipulating (semi-) structured data as well as ingesting data from sources that provide schema such as JSON Parquet Hive or EDWs It unifies SQL and sophisticated analysis allowing users to mix and match SQL and more imperative programming APIs for advanced analytics

Spark MLlib

lsquoSpark MLlib rsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib

Spark Streaming

lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-spark-streaming

Storm vs Spark StreamingCriteria

Processing Model Record at a time Mini batches

Latency Sub second Few seconds

Fault tolerancendash every record processed

At least one ( may be duplicates)

Exactly one

Batch Framework integration

Not available Core Spark API

Supported languages

Any programming language

Scala Java Python

GraphX

lsquoGraphXrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag6-graphx

Notebook bull Zeppelin httpzeppelin-projectorg is a web-based notebook that enables interactive data analytics Has built-in Apache Spark support

bull Spark Notebook is an interactive web-based editor that can combine Scala code SQL queries Markup or even JavaScript in a collaborative manner httpsgithubcomandypetrellaspark-notebook

bull ISpark is an Apache Spark-shell backend for IPython httpsgithubcomtribbloidISpark

IV Spark on Non-Hadoop1 File System2 Deployment 3 Distributions4 Alternatives5 Key Takeaways

6 Key Takeaways1 File System Spark is File System Agnostic Bring Your Own Storage2 Deployment Spark is Cluster Infrastructure Agnostic Choose your deployment 3 Distributions You are no longer tied to Hadoop for Big Data processing Spark distributions as service in the cloud or imbedded in Non-Hadoop distributions are emerging4 Alternatives Do your due diligence based on your own use case and research pros and cons before picking a specific tool or switching from one tool to another

IV More QampA

sbaltagigmailcom

SlimBaltagi

httpwwwslidesharenetsbaltagi

Your Presenter ndash Slim Baltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

III Spark with Hadoop

1 Evolution of Programming APIs

1 Evolution of Compute Models

1 Evolution

1 Evolution (2)

1 Evolution

1 Evolution Apache Flink

Hadoop MapReduce vs Tez vs Spark

Hadoop MapReduce vs Tez vs Spark (2)

2 Transition

2 Transition (2)

Pig on Spark (Spork)

Hive on Spark (Currently in Beta Expected i

Hive on Spark (Currently in Beta Expec

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

III Spark with Hadoop (2)

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration Kite SDK

3 Integration

4 Complementarity

4 Complementarity +

4 Complementarity + (2)

4 Complementarity +

4 Complementarity (2)

5 Key Takeaways (3)

IV Spark without Hadoop

1 File System

1 File System (2)

IV Spark without Hadoop (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

Storm vs Spark Streaming

GraphX

Notebook

IV Spark on Non-Hadoop

6 Key Takeaways

IV More QampA

Your Presenter ndash Slim Baltagibull Sr Big Data Solutions Architect

living in Chicagobull Over 17 years of IT and business

experiencesbull Over 4 years of Big Data

experience working on over 12 Hadoop projects

bull Speaker at Big Data eventsbull Creator and maintainer of the

Apache Spark Knowledge Base httpwwwSparkBigDatacom with over 4000 categorized Apache Spark web resources

SlimBaltagi

sbaltagigmailcom

Disclaimer This is a vendor-independent talk that expresses my own opinions I am not endorsing nor promoting any product or vendor mentioned in this talk

Spark ecosystem

bull 1st Generation

bull 2nd

Generationbull 3rd

Generationbull 4th

Generation

Processing Model

Language written in

Java Java Scala

Compatibility

YARN integration

Partial support

Data Formats

Search

ction-setup PART 1

3 Integration

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

Spark ecosystem

bull 1st Generation

bull 2nd

Generationbull 3rd

Generationbull 4th

Generation

Processing Model

Language written in

Java Java Scala

Compatibility

YARN integration

Partial support

Data Formats

Search

ction-setup PART 1

3 Integration

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

Spark ecosystem

bull 1st Generation

bull 2nd

Generationbull 3rd

Generationbull 4th

Generation

Processing Model

Language written in

Java Java Scala

Compatibility

YARN integration

Partial support

Data Formats

Search

ction-setup PART 1

3 Integration

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

Spark ecosystem

bull 1st Generation

bull 2nd

Generationbull 3rd

Generationbull 4th

Generation

Processing Model

Language written in

Java Java Scala

Compatibility

YARN integration

Partial support

Data Formats

Search

ction-setup PART 1

3 Integration

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

Spark ecosystem

bull 1st Generation

bull 2nd

Generationbull 3rd

Generationbull 4th

Generation

Processing Model

Language written in

Java Java Scala

Compatibility

YARN integration

Partial support

Data Formats

Search

ction-setup PART 1

3 Integration

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

Spark ecosystem

bull 1st Generation

bull 2nd

Generationbull 3rd

Generationbull 4th

Generation

Processing Model

Language written in

Java Java Scala

Compatibility

YARN integration

Partial support

Data Formats

Search

ction-setup PART 1

3 Integration

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

Spark ecosystem

bull 1st Generation

bull 2nd

Generationbull 3rd

Generationbull 4th

Generation

Processing Model

Language written in

Java Java Scala

Compatibility

YARN integration

Partial support

Data Formats

Search

ction-setup PART 1

3 Integration

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

Spark ecosystem

bull 1st Generation

bull 2nd

Generationbull 3rd

Generationbull 4th

Generation

Processing Model

Language written in

Java Java Scala

Compatibility

YARN integration

Partial support

Data Formats

Search

ction-setup PART 1

3 Integration

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

Spark ecosystem

bull 1st Generation

bull 2nd

Generationbull 3rd

Generationbull 4th

Generation

Processing Model

Language written in

Java Java Scala

Compatibility

YARN integration

Partial support

Data Formats

Search

ction-setup PART 1

3 Integration

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

Spark ecosystem

bull 1st Generation

bull 2nd

Generationbull 3rd

Generationbull 4th

Generation

Processing Model

Language written in

Java Java Scala

Compatibility

YARN integration

Partial support

Data Formats

Search

ction-setup PART 1

3 Integration

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

Spark ecosystem

bull 1st Generation

bull 2nd

Generationbull 3rd

Generationbull 4th

Generation

Processing Model

Language written in

Java Java Scala

Compatibility

YARN integration

Partial support

Data Formats

Search

ction-setup PART 1

3 Integration

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

Spark ecosystem

bull 1st Generation

bull 2nd

Generationbull 3rd

Generationbull 4th

Generation

Processing Model

Language written in

Java Java Scala

Compatibility

YARN integration

Partial support

Data Formats

Search

ction-setup PART 1

3 Integration

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

Spark ecosystem

bull 1st Generation

bull 2nd

Generationbull 3rd

Generationbull 4th

Generation

Processing Model

Language written in

Java Java Scala

Compatibility

YARN integration

Partial support

Data Formats

Search

ction-setup PART 1

3 Integration

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

Spark ecosystem

bull 1st Generation

bull 2nd

Generationbull 3rd

Generationbull 4th

Generation

Processing Model

Language written in

Java Java Scala

Compatibility

YARN integration

Partial support

Data Formats

Search

ction-setup PART 1

3 Integration

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

Spark ecosystem

bull 1st Generation

bull 2nd

Generationbull 3rd

Generationbull 4th

Generation

Processing Model

Language written in

Java Java Scala

Compatibility

YARN integration

Partial support

Data Formats

Search

ction-setup PART 1

3 Integration

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

Spark ecosystem

bull 1st Generation

bull 2nd

Generationbull 3rd

Generationbull 4th

Generation

Processing Model

Language written in

Java Java Scala

Compatibility

YARN integration

Partial support

Data Formats

Search

ction-setup PART 1

3 Integration

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

Spark ecosystem

bull 1st Generation

bull 2nd

Generationbull 3rd

Generationbull 4th

Generation

Processing Model

Language written in

Java Java Scala

Compatibility

YARN integration

Partial support

Data Formats

Search

ction-setup PART 1

3 Integration

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

Spark ecosystem

bull 1st Generation

bull 2nd

Generationbull 3rd

Generationbull 4th

Generation

Processing Model

Language written in

Java Java Scala

Compatibility

YARN integration

Partial support

Data Formats

Search

ction-setup PART 1

3 Integration

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

Spark ecosystem

bull 1st Generation

bull 2nd

Generationbull 3rd

Generationbull 4th

Generation

Processing Model

Language written in

Java Java Scala

Compatibility

YARN integration

Partial support

Data Formats

Search

ction-setup PART 1

3 Integration

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

bull 1st Generation

bull 2nd

Generationbull 3rd

Generationbull 4th

Generation

Processing Model

Language written in

Java Java Scala

Compatibility

YARN integration

Partial support

Data Formats

Search

ction-setup PART 1

3 Integration

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

bull 1st Generation

bull 2nd

Generationbull 3rd

Generationbull 4th

Generation

Processing Model

Language written in

Java Java Scala

Compatibility

YARN integration

Partial support

Data Formats

Search

ction-setup PART 1

3 Integration

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

bull 1st Generation

bull 2nd

Generationbull 3rd

Generationbull 4th

Generation

Processing Model

Language written in

Java Java Scala

Compatibility

YARN integration

Partial support

Data Formats

Search

ction-setup PART 1

3 Integration

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

Processing Model

Language written in

Java Java Scala

Compatibility

YARN integration

Partial support

Data Formats

Search

ction-setup PART 1

3 Integration

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

Processing Model

Language written in

Java Java Scala

Compatibility

YARN integration

Partial support

Data Formats

Search

ction-setup PART 1

3 Integration

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

Processing Model

Language written in

Java Java Scala

Compatibility

YARN integration

Partial support

Data Formats

Search

ction-setup PART 1

3 Integration

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

Processing Model

Language written in

Java Java Scala

Compatibility

YARN integration

Partial support

Data Formats

Search

ction-setup PART 1

3 Integration

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

Processing Model

Language written in

Java Java Scala

Compatibility

YARN integration

Partial support

Data Formats

Search

ction-setup PART 1

3 Integration

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

Compatibility

YARN integration

Partial support

Data Formats

Search

ction-setup PART 1

3 Integration

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

Partial support

Data Formats

Search

ction-setup PART 1

3 Integration

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

Data Formats

Search

ction-setup PART 1

3 Integration

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

Data Formats

Search

ction-setup PART 1

3 Integration

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

Data Formats

Search

ction-setup PART 1

3 Integration

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

Data Formats

Search

ction-setup PART 1

3 Integration

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

Data Formats

Search

ction-setup PART 1

3 Integration

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

Data Formats

Search

ction-setup PART 1

3 Integration

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

Data Formats

Search

ction-setup PART 1

3 Integration

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

Data Formats

Search

ction-setup PART 1

3 Integration

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

Data Formats

Search

ction-setup PART 1

3 Integration

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

Data Formats

Search

ction-setup PART 1

3 Integration

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

Data Formats

Search

ction-setup PART 1

3 Integration

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

Data Formats

Search

ction-setup PART 1

3 Integration

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

Data Formats

Search

ction-setup PART 1

3 Integration

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

ction-setup PART 1

3 Integration

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

ction-setup PART 1

3 Integration

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

ction-setup PART 1

3 Integration

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

ction-setup PART 1

3 Integration

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

ction-setup PART 1

3 Integration

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

ction-setup PART 1

3 Integration

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

3 Integration

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

3 Integration

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

3 Integration

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

3 Integration

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

3 Integration

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

3 Integration

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

3 Integration

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

3 Integration

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

3 Integration

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

3 Integration

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

3 Integration

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

3 Integration

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

Resource sharing

Yes Yes

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

Spark MLlib

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

Spark Streaming

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

Exactly one

Supported languages

Scala Java Python

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

GraphX

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

sbaltagigmailcom

SlimBaltagi

Agenda

I Motivation

1 News

2 Surveys

3 Vendors

3 Vendors (2)

3 Vendors (3)

3 Vendors (4)

4 Analysts

5 Key Takeaways

1 Big Data

3 Apache Hadoop

4 Apache Spark

5 Key Takeaways (2)

1 Evolution

1 Evolution (2)

1 Evolution

2 Transition

2 Transition (2)

Sqoop on Spark

(Expected in 31 r

Apache Crunch

(Expec

3 Integration

3 Integration (2)

3 Integration

3 Integration (2)

3 Integration (3)

3 Integration (4)

3 Integration (5)

3 Integration YARN

3 Integration (3)

3 Integration (6)

3 Integration (4)

3 Integration (5)

3 Integration (6)

3 Integration (7)

3 Integration (8)

3 Integration

4 Complementarity

4 Complementarity +

5 Key Takeaways (3)

1 File System

1 File System (2)

2 Deployment

3 Distributions

4 Alternatives

YARN vs Mesos

Spark Native API

Spark SQL

Spark MLlib

Spark Streaming

GraphX

Notebook

6 Key Takeaways

IV More QampA

Data & Analytics

Hadoop or Spark: is it an either-or proposition? By Slim Baltagi