Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

The New Analytics Toolbox

Going beyond Hadoop

whoami

Robbie StricklandSoftware Dev Managerlinkedin.com/in/robbiestricklandrostrickland@gmail.com

Hadoop works fine, right?

● It’s 11 years old (!!)

● It’s 11 years old (!!)● It’s relatively slow and inefficient

● It’s 11 years old (!!)● It’s relatively slow and inefficient● … because everything gets written to disk

● It’s 11 years old (!!)● It’s relatively slow and inefficient● … because everything gets written to disk● Writing mapreduce code sucks

● It’s 11 years old (!!)● It’s relatively slow and inefficient● … because everything gets written to disk● Writing mapreduce code sucks● Lots of code for even simple tasks

● It’s 11 years old (!!)● It’s relatively slow and inefficient● … because everything gets written to disk● Writing mapreduce code sucks● Lots of code for even simple tasks● Boilerplate

● It’s 11 years old (!!)● It’s relatively slow and inefficient● … because everything gets written to disk● Writing mapreduce code sucks● Lots of code for even simple tasks● Boilerplate● Configuration isn’t for the faint of heart

Landscape has changed ...

Lots of new tools available:● Cloudera Impala● Apache Drill● Proprietary (Splunk, keen.io, etc.)● Spark / Spark SQL● Shark

MPP queries for Hadoop data

Varies by vendor

Generic in-memory analysis

Lots of new tools available:● Cloudera Impala● Apache Drill● Proprietary (Splunk, keen.io, etc.)● Spark / Spark SQL● Shark Hive queries on Spark, replaced by Spark SQL

Spark wins?

● Can be a Hadoop replacement

Spark wins?

● Can be a Hadoop replacement● Works with any existing InputFormat

Spark wins?

● Can be a Hadoop replacement● Works with any existing InputFormat● Doesn’t require Hadoop

Spark wins?

● Can be a Hadoop replacement● Works with any existing InputFormat● Doesn’t require Hadoop● Supports batch & streaming analysis

Spark wins?

● Can be a Hadoop replacement● Works with any existing InputFormat● Doesn’t require Hadoop● Supports batch & streaming analysis● Functional programming model

Spark wins?

● Can be a Hadoop replacement● Works with any existing InputFormat● Doesn’t require Hadoop● Supports batch & streaming analysis● Functional programming model● Direct Cassandra integration

Spark vs. Hadoop

MapReduce:

Spark (angels sing!):

What is Spark?

● In-memory cluster computing

What is Spark?

● In-memory cluster computing● 10-100x faster than MapReduce

What is Spark?

● In-memory cluster computing● 10-100x faster than MapReduce● Collection API over large datasets

What is Spark?

● In-memory cluster computing● 10-100x faster than MapReduce● Collection API over large datasets● Scala, Python, Java

What is Spark?

● In-memory cluster computing● 10-100x faster than MapReduce● Collection API over large datasets● Scala, Python, Java● Stream processing

What is Spark?

● In-memory cluster computing● 10-100x faster than MapReduce● Collection API over large datasets● Scala, Python, Java● Stream processing● Supports any existing Hadoop input / output

format

What is Spark?

● Native graph processing via GraphX

What is Spark?

● Native graph processing via GraphX● Native machine learning via MLlib

What is Spark?

● Native graph processing via GraphX● Native machine learning via MLlib● SQL queries via SparkSQL

What is Spark?

● Native graph processing via GraphX● Native machine learning via MLlib● SQL queries via SparkSQL● Works out of the box on EMR

What is Spark?

● Native graph processing via GraphX● Native machine learning via MLlib● SQL queries via SparkSQL● Works out of the box on EMR● Easily join datasets from disparate sources

Spark on Cassandra

● Direct integration via DataStax driver - cassandra-driver-spark on github

Spark on Cassandra

● No job config cruft

Spark on Cassandra

● No job config cruft● Supports server-side filters (where clauses)

Spark on Cassandra

● No job config cruft● Supports server-side filters (where clauses)● Data locality aware

Spark on Cassandra

● No job config cruft● Supports server-side filters (where clauses)● Data locality aware● Uses HDFS, CassandraFS, or other

distributed FS for checkpointing

Cassandra with Hadoop

CassandraDataNode

TaskTracker

CassandraDataNode

TaskTracker

CassandraDataNode

TaskTracker

NameNode2ndaryNN

JobTracker

Cassandra with Spark (using HDFS)

CassandraSpark Worker

DataNode

Spark MasterNameNode2ndaryNN

A Typical Spark Application

A Typical Spark Application● SparkContext + SparkConf

● Data Source to RDD[T]

● Transformations/Actions

● Saving/Displaying

Resilient Distributed Dataset (RDD)

Resilient Distributed Dataset (RDD)● A distributed collection of items

● Transformations

○ Similar to those found in scala collections

○ Lazily processed

Resilient Distributed Dataset (RDD)● A distributed collection of items

● Transformations

○ Similar to those found in scala collections

○ Lazily processed

● Can recalculate from any point of failure

RDD Transformations vs Actions

Transformations: Produce new RDDs

Actions: Require the materialization of the records to produce a value

RDD Transformations/Actions

Transformations: filter, map, flatMap, collect(λ):RDD[T], distinct, groupBy, subtract, union, zip, reduceByKey ...

Actions: collect:Array[T], count, fold, reduce ...

Resilient Distributed Dataset (RDD)

val numOfAdults = persons.filter(_.age > 17).count()

Transformation Action

Example

Spark SQL Example

Spark Streaming

● Creates RDDs from stream source on a defined interval

Spark Streaming

● Same ops as “normal” RDDs

Spark Streaming

● Same ops as “normal” RDDs● Supports a variety of sources

○ Queues○ Twitter○ Flume○ etc.

Spark Streaming

The Output

Real World Task Distribution

Transformations and Actions:

Similar to Scala but …

Your choice of transformations need be made with task distribution and memory in mind

How do we do that?

● Partition the data appropriately for number of cores (at least 2 x number of cores)

● Filter early and often (Queries, Filters, Distinct...)● Use pipelines

How do we do that?

● Partial Aggregation● Create an algorithm to be as simple/efficient as it can be

to appropriately answer the question

Some Common Costly Transformations:

● sorting● groupByKey● reduceByKey● ...

Thank you!

Robbie Stricklandrostrickland@gmail.comhttps://github.com/rstrickland/sparkdemo

Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - Going Beyond Hadoop

Technology

DevCenter Apache cassandra

Apache Hadoop Ecosystem - LIAS (Lab · Apache Hadoop Ecosystem ... Apache Drill, Cloudera Impala. Thank you for Your Attention Q & A Apache Hadoop Ecosystem ENSMA …

Apache Cassandra™ Documentationcourses.physics.illinois.edu/cs425/fa2017/cassandra10.pdfApache Cassandra 1.0 Documentation Introduction to Apache Cassandra Apache Cassandra is a

Realtime Apache Hadoop at Facebook - Peopleistoica/classes/... · In early 2010, engineers at FB compared DBs Apache Cassandra, Apache HBase, Sharded MySQL Compared performance, scalability,

Apache Cassandra Ignite Presentation

Support Apache Cassandra in Production · Anuj Wadehra . Architect & Cassandra SME . Ericsson R & D . Support APACHE Cassandra in Production

Apache Hadoop Today & Tomorrow - SNIA€¦ · Apache Hadoop Projects . Programming Languages . Computation Object Storage Zookeeper (Coordination) Core Apache Hadoop Related Apache

Introduction to Apache Cassandra - DataStax - · PDF fileIntroduction to Apache Cassandra . 2" ... Apache Cassandra™ is a massively scalable NoSQL database. Cassandra’s technical

DataStax Distribution of Apache Cassandra · Distribution of Apache Cassandra Steps to install or uninstall the DataStax Distribution of Apache Cassandra™ Installing DataStax Distribution

Cassandra Community Webinar: Apache Cassandra Internals

Cassandra Day Atlanta 2015: Troubleshooting with Apache Cassandra

Apache Cassandra, part 3 – machinery, work with Cassandra

Apache Hadoop Today & Tomorrow · 2019-12-21 · Apache Hadoop Projects . Programming Languages . Computation Object Storage Zookeeper (Coordination) Core Apache Hadoop Related Apache

Apache Cassandra in Action - O'Reilly Mediaassets.en.oreilly.com/1/event/55/Apache Cassandra in Action... · Apache Cassandra in Action. Why Cassandra? ... Cassandra in production

Cassandra Day Denver 2014: Introduction to Apache Cassandra

Cassandra/Hadoop Integration

Coverity Scansoftwareintegrity.coverity.com/rs/appsec/images/2013-Coverity-Scan-Report.pdf · Apache Cassandra 19 Apache CloudStack 20 Apache Hadoop 21 Apache HBase 22 ... their software

Cassandra Summit 2014: Apache Cassandra at Telefonica CBS

Cassandra + Hadoop = Brisk

Evaluating Apache Cassandra as a Cloud DatabaseDataStax Enterprise – Certified Cassandra for Production Applications ..... 11 Solving the Cloud Mixed-Workload Problem ..... 11 Hadoop