Boosting big data with apache spark

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Data Science Company

Boosting Big Data with Apache Spark

Mathias LavaertApril 2015

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.beData Science Company

About Infofarm


Data Science

Big Data

Identifying, extracting and using data of all types

and origins; exploring, correlating and using it in new

and innovative ways in order to extract meaning

and business value from it.



Java

PHPE-Commerce

Mobile

Web

Development



About me

Mathias LavaertBig Data Developer at InfoFarm since May, 2014

Proud citizen of West-Flanders

Outdoor enthusiast


Agenda

• What is Apache Spark?

• An in-depth overview– Spark Core and Resilient Distributed Data

– Unified access to structured data with Spark SQL

– Machine Learning with Spark MLLib

– Scalable streaming applications Spark Streaming

• Q&A

• Wrap-up & lunch


What is Apache Spark?


“Apache Spark is a fast and general engine for big data

processing, with built-in modules for streaming, SQL,

machine learning and graph processing”


History

• Created by Matei Zaharia at UC Berkeley in 2009

• Based on 2007 Microsoft Dryad paper

• Donated in 2013 to Apache Software Foundation

• 465 contributors in 2014 making it the most active

Apache Project

• Currently supported by Databricks, a company founded

by the creators of Apache Spark


Target users

● Data Scientists○ Data exploration and data modelling using interactive

shells

○ Machine Learning

○ Ad Hoc analysis to answer business questions or

discovering new insights

● Engineers○ Fault-tolerant production data applications

○ ‘Productizing’ the work of the data scientist

○ Integration with business application


Where to situate Apache Spark?


Differences with MapReduce

• Faster by minimizing IO and trying to use

the memory as much as possible

• Unified libraries

• Huge community effort, very fast

development pace.

• Ships with higher level tools included


Daytona GraySort Contest


Differences with Hive, Pig, others...

• One integrated framework that suits a

wide range of problems

• No need for a workflow application like

Oozie

• Only 1 language/framework to learn


Explosion of Specialized Systems


Architecture


Advantages of unified libraries

Advancements in higher-level libraries are pushed down into core and

vice-versa

● Spark Core

○ Highly-optimized, low overhead, network-saturating shuffle

● Spark Streaming

○ Garbage collection, memory management, cleanup

improvements

● Spark GraphX

○ IndexedRDD for random access within a partition vs scanning

entire partition

● Spark MLLib

○ Statistics (Correlations, sampling, heuristics)


Supported languages


Difference between Java and Scala


Cluster Resource Managers● Spark Standalone

○ Suitable for a lot of production workloads

○ Only suitable for Spark workloads

● YARN

○ Allows hierarchies of resources

○ Kerberos integration

○ Multiple workloads from different execution frameworks

■ Hive, Pig, Spark, MapReduce, Cascading, etc…

● Mesos

○ Similar to YARN, but allows elastic allocation

○ Coarse-grained

■ Single, long-running Mesos tasks runs Spark mini tasks

○ Fine-grained

■ New Mesos task for each Spark task

■ Higher overhead, not good for long-running Spark jobs

(Streaming)


Storage Layers for Spark

Spark can create distributed datasets from:

● Any file stored in the Hadoop distributed filesystem (HDFS)

● Any storage system supported by the Hadoop APIs

○ Local filesystem

○ S3

○ Cassandra

○ Hive

○ HBase

Note that Apache Spark doesn’t require Hadoop, but it has support for

storage systems implementing the Hadoop APIs.


Short introduction to functional

programming


What is functional programming?

A programming paradigm where the

basic unit of abstraction is the function


Basic concepts ● Higher-order functions

○ Are functions that can either take other functions as

arguments

○ or return functions as a result of a function

● Pure functions

○ Purely functional expressions have no side effects

● Recursion

○ Iteration in functional languages is usually

accomplished via recursion.

● Immutable data structures


Small example with a functional

language: Scala


Introduction to Spark concepts


Resilient Distributed Datasets (RDDs)● Core Spark abstraction

● Immutable distributed collection of objects

● Split into multiple partitions

● May be computed on different nodes of the cluster

● Can contain any type of Scala, Java or Python object

including user-defined classes

“Distributed Scala collections”


Driver and context● Driver

○ Shell

○ Standalone program

● Spark Context represents a connection to a computing cluster


RDD Operations● Transformations

○ map

○ filter

○ flatMap

○ sample

○ groupByKey

○ reduceByKey

○ union

○ join

○ sort

● Actions

○ count

○ collect

○ reduce

○ lookup

○ save

● Transformations are lazy

● Actions force the computation of transformations


Narrow vs wide dependencies


Demo using only core operations


Specialized operations for specific

types of RDDs


Specialized operations for Key/Value pairs

● reduceByKey

● groupByKey

● combineByKey

● mapValues

● flatMapValues

● keys

● sortByKey

● subtractByKey

● join

● rightOuterJoin

● leftOuterJoin

● cogroup


Specialized operations for numeric RDDs

● count

● mean

● sum

● max

● min

● variance

● sampleVariance

● stdev

● sampleStDev


And many more...

● HadoopRDD

● FilteredRDD

● MappedRDD

● PairRDD

● ShuffledRDD

● UnionRDD

● DoubleRDD

● JdbcRDD

● JsonRDD

● SchemaRDD

● VertexRDD

● EdgeRDD

● CassandraRDD

● GeoRDD

● EsSpark (Elastic Search


Spark SQL


Spark SQL Overview● Newest component of Spark

● Tightly integrated to work with structured data

○ Tables with rows and columns

● Transform RDDs using SQL

● Data source integration: Hive, Parquet, JSON and more…

● Optimizes execution plan


Differences with Spark Core● Spark + RDDs

○ Functional transformations on

collections of objects

● SQL + SchemaRDDs

○ Declarative transformations on

collections of tuples


Getting started with Spark SQL● Create an instance of SQLContext or HiveContext

○ Entry point for all SQL functionality

○ Wraps/extends existing Spark Context (Decorator Pattern)

● If you’re using the shell a SQLContext has been created for you

val sparkContext = new SparkContext("local[4]", "SQL")

val sqlContext = new SQLContext(sparkContext)


Language Integrated UDFs● Ability to write custom SQL-functions in one of the languages that is

supported by Spark

● Another example on how Spark simplifies the big data stack


Parquet compatibilityNative support for reading data stored in Parquet:

● Columnar storage avoids reading unneeded data

● SchemaRDDs can be written to Parquet while preserving the schema

● Convert other slower formats like JSON to Parquet for repeated querying.


Demo: Spark SQL


Spark MLLib


Machine Learning Algorithms● Supervised

○ Prediction: Train a model with existing data + label, predict

label for new data

■ Classification (categorical)

■ Regression (continuous numeric)

○ Recommendation: recommend to similar users

■ User -> user, item -> item, user -> item similarity

● Unsupervised

○ Clustering: Find natural clusters in data based on similarities


Algorithms provided by Spark● Classification and regression

○ Linear models (SVMs, logistic regression, linear regression)

○ Naive Bayes

○ Decision trees

○ Ensembles of trees (Random Forests and Gradient-Boosted trees)

○ Isotonic regression

● Recommendations

○ Alternating Least Squares (ALS)

○ FP-growth

● Clustering

○ K-Means

○ Gaussian mixture

○ Power Iteration clustering

○ Latent Dirichlet allocation

○ Streaming k-means

● Dimensionality reduction

○ Singular value decomposition (SVD)

○ Principal component analysis (PCA)


Tools provided by Spark

● Tools for basic statistics including

○ Summary statistics

○ Correlations

○ Sampling

○ Hypothesis testing

○ Random data generation

● Tools for feature extraction and transformation

○ Extracting features out of text

○ Uniform Vector format to store features

● Tools to build Machine Learning Pipelines

using Spark SQL


Why choose for MLLib?

● One of the best documented machine learning

libraries available for the JVM

● Simple API, constructs are the same for different

algorithms

● Well integrated with other Spark-components


Demo: Spark MLLib


Spark Streaming


Spark Streaming Overview

● Build around the concept of DStreams or discretized

streams

● Long-running Spark application

● Micro-batch architecture

● Supports Flume, Kafka, Twitter, Amazon Kinesis,

Socket, File…


DStreams

● A sequence of RDDs

● Stateless transformations

● Stateful transformations

● Checkpointing


Spark Streaming Use Cases

● ETL and enrichment of streaming data on ingestion

● Lambda Architecture

● Operational dashboards


Demo: Spark Streaming


Spark on Amazon EC2


Apache Spark runs easily on Amazon EC2

Apache Spark comes with a script to launch Spark clusters

on Amazon EC2.

So there is no need to invest in a cluster of servers...

Furthermore it has support for multiple Amazon

components.

● Spark can read files from Amazon S3

● Spark Streaming can easily be integrated with Amazon

Kinesis


Conclusion


Why choose for Apache Spark?

● Modern integrated full-stack Big Data framework

● Suitable for both batch and (near) real time applications

● Well supported by a very large community

● The Big Data landscape seems to shift to Apache Spark


Questions?