60
Veldkant 33A, Kontich [email protected] www.infofarm.be Data Science Company Boosting Big Data with Apache Spark Mathias Lavaert April 2015

Boosting big data with apache spark

Embed Size (px)

Citation preview

Page 1: Boosting big data with apache spark

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Data Science Company

Boosting Big Data with Apache Spark

Mathias LavaertApril 2015

Page 2: Boosting big data with apache spark

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.beData Science Company

About Infofarm

Page 3: Boosting big data with apache spark

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Data Science

Big Data

Identifying, extracting and using data of all types

and origins; exploring, correlating and using it in new

and innovative ways in order to extract meaning

and business value from it.

Page 4: Boosting big data with apache spark

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Page 5: Boosting big data with apache spark

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Java

PHPE-Commerce

Mobile

Web

Development

Page 6: Boosting big data with apache spark

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Page 7: Boosting big data with apache spark

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

About me

Mathias LavaertBig Data Developer at InfoFarm since May, 2014

Proud citizen of West-Flanders

Outdoor enthusiast

Page 8: Boosting big data with apache spark

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Agenda

• What is Apache Spark?

• An in-depth overview– Spark Core and Resilient Distributed Data

– Unified access to structured data with Spark SQL

– Machine Learning with Spark MLLib

– Scalable streaming applications Spark Streaming

• Q&A

• Wrap-up & lunch

Page 9: Boosting big data with apache spark

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.beData Science Company

What is Apache Spark?

Page 10: Boosting big data with apache spark

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

“Apache Spark is a fast and general engine for big data

processing, with built-in modules for streaming, SQL,

machine learning and graph processing”

Page 11: Boosting big data with apache spark

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

History

• Created by Matei Zaharia at UC Berkeley in 2009

• Based on 2007 Microsoft Dryad paper

• Donated in 2013 to Apache Software Foundation

• 465 contributors in 2014 making it the most active

Apache Project

• Currently supported by Databricks, a company founded

by the creators of Apache Spark

Page 12: Boosting big data with apache spark

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Target users

● Data Scientists○ Data exploration and data modelling using interactive

shells

○ Machine Learning

○ Ad Hoc analysis to answer business questions or

discovering new insights

● Engineers○ Fault-tolerant production data applications

○ ‘Productizing’ the work of the data scientist

○ Integration with business application

Page 13: Boosting big data with apache spark

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Where to situate Apache Spark?

Page 14: Boosting big data with apache spark

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Differences with MapReduce

• Faster by minimizing IO and trying to use

the memory as much as possible

• Unified libraries

• Huge community effort, very fast

development pace.

• Ships with higher level tools included

Page 15: Boosting big data with apache spark

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Daytona GraySort Contest

Page 16: Boosting big data with apache spark

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Differences with Hive, Pig, others...

• One integrated framework that suits a

wide range of problems

• No need for a workflow application like

Oozie

• Only 1 language/framework to learn

Page 17: Boosting big data with apache spark

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Explosion of Specialized Systems

Page 18: Boosting big data with apache spark

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Architecture

Page 19: Boosting big data with apache spark

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Advantages of unified libraries

Advancements in higher-level libraries are pushed down into core and

vice-versa

● Spark Core

○ Highly-optimized, low overhead, network-saturating shuffle

● Spark Streaming

○ Garbage collection, memory management, cleanup

improvements

● Spark GraphX

○ IndexedRDD for random access within a partition vs scanning

entire partition

● Spark MLLib

○ Statistics (Correlations, sampling, heuristics)

Page 20: Boosting big data with apache spark

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Supported languages

Page 21: Boosting big data with apache spark

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Difference between Java and Scala

Page 22: Boosting big data with apache spark

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Cluster Resource Managers● Spark Standalone

○ Suitable for a lot of production workloads

○ Only suitable for Spark workloads

● YARN

○ Allows hierarchies of resources

○ Kerberos integration

○ Multiple workloads from different execution frameworks

■ Hive, Pig, Spark, MapReduce, Cascading, etc…

● Mesos

○ Similar to YARN, but allows elastic allocation

○ Coarse-grained

■ Single, long-running Mesos tasks runs Spark mini tasks

○ Fine-grained

■ New Mesos task for each Spark task

■ Higher overhead, not good for long-running Spark jobs

(Streaming)

Page 23: Boosting big data with apache spark

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Storage Layers for Spark

Spark can create distributed datasets from:

● Any file stored in the Hadoop distributed filesystem (HDFS)

● Any storage system supported by the Hadoop APIs

○ Local filesystem

○ S3

○ Cassandra

○ Hive

○ HBase

Note that Apache Spark doesn’t require Hadoop, but it has support for

storage systems implementing the Hadoop APIs.

Page 24: Boosting big data with apache spark

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.beData Science Company

Short introduction to functional

programming

Page 25: Boosting big data with apache spark

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

What is functional programming?

A programming paradigm where the

basic unit of abstraction is the function

Page 26: Boosting big data with apache spark

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Basic concepts ● Higher-order functions

○ Are functions that can either take other functions as

arguments

○ or return functions as a result of a function

● Pure functions

○ Purely functional expressions have no side effects

● Recursion

○ Iteration in functional languages is usually

accomplished via recursion.

● Immutable data structures

Page 27: Boosting big data with apache spark

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Small example with a functional

language: Scala

Page 28: Boosting big data with apache spark

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.beData Science Company

Introduction to Spark concepts

Page 29: Boosting big data with apache spark

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Resilient Distributed Datasets (RDDs)● Core Spark abstraction

● Immutable distributed collection of objects

● Split into multiple partitions

● May be computed on different nodes of the cluster

● Can contain any type of Scala, Java or Python object

including user-defined classes

“Distributed Scala collections”

Page 30: Boosting big data with apache spark

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Driver and context● Driver

○ Shell

○ Standalone program

● Spark Context represents a connection to a computing cluster

Page 31: Boosting big data with apache spark

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

RDD Operations● Transformations

○ map

○ filter

○ flatMap

○ sample

○ groupByKey

○ reduceByKey

○ union

○ join

○ sort

● Actions

○ count

○ collect

○ reduce

○ lookup

○ save

● Transformations are lazy

● Actions force the computation of transformations

Page 32: Boosting big data with apache spark

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Narrow vs wide dependencies

Page 33: Boosting big data with apache spark

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Demo using only core operations

Page 34: Boosting big data with apache spark

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Specialized operations for specific

types of RDDs

Page 35: Boosting big data with apache spark

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Specialized operations for Key/Value pairs

● reduceByKey

● groupByKey

● combineByKey

● mapValues

● flatMapValues

● keys

● sortByKey

● subtractByKey

● join

● rightOuterJoin

● leftOuterJoin

● cogroup

Page 36: Boosting big data with apache spark

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Specialized operations for numeric RDDs

● count

● mean

● sum

● max

● min

● variance

● sampleVariance

● stdev

● sampleStDev

Page 37: Boosting big data with apache spark

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

And many more...

● HadoopRDD

● FilteredRDD

● MappedRDD

● PairRDD

● ShuffledRDD

● UnionRDD

● DoubleRDD

● JdbcRDD

● JsonRDD

● SchemaRDD

● VertexRDD

● EdgeRDD

● CassandraRDD

● GeoRDD

● EsSpark (Elastic Search

Page 38: Boosting big data with apache spark

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.beData Science Company

Spark SQL

Page 39: Boosting big data with apache spark

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Spark SQL Overview● Newest component of Spark

● Tightly integrated to work with structured data

○ Tables with rows and columns

● Transform RDDs using SQL

● Data source integration: Hive, Parquet, JSON and more…

● Optimizes execution plan

Page 40: Boosting big data with apache spark

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Differences with Spark Core● Spark + RDDs

○ Functional transformations on

collections of objects

● SQL + SchemaRDDs

○ Declarative transformations on

collections of tuples

Page 41: Boosting big data with apache spark

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Getting started with Spark SQL● Create an instance of SQLContext or HiveContext

○ Entry point for all SQL functionality

○ Wraps/extends existing Spark Context (Decorator Pattern)

● If you’re using the shell a SQLContext has been created for you

val sparkContext = new SparkContext("local[4]", "SQL")

val sqlContext = new SQLContext(sparkContext)

Page 42: Boosting big data with apache spark

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Language Integrated UDFs● Ability to write custom SQL-functions in one of the languages that is

supported by Spark

● Another example on how Spark simplifies the big data stack

Page 43: Boosting big data with apache spark

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Parquet compatibilityNative support for reading data stored in Parquet:

● Columnar storage avoids reading unneeded data

● SchemaRDDs can be written to Parquet while preserving the schema

● Convert other slower formats like JSON to Parquet for repeated querying.

Page 44: Boosting big data with apache spark

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Demo: Spark SQL

Page 45: Boosting big data with apache spark

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.beData Science Company

Spark MLLib

Page 46: Boosting big data with apache spark

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Machine Learning Algorithms● Supervised

○ Prediction: Train a model with existing data + label, predict

label for new data

■ Classification (categorical)

■ Regression (continuous numeric)

○ Recommendation: recommend to similar users

■ User -> user, item -> item, user -> item similarity

● Unsupervised

○ Clustering: Find natural clusters in data based on similarities

Page 47: Boosting big data with apache spark

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Algorithms provided by Spark● Classification and regression

○ Linear models (SVMs, logistic regression, linear regression)

○ Naive Bayes

○ Decision trees

○ Ensembles of trees (Random Forests and Gradient-Boosted trees)

○ Isotonic regression

● Recommendations

○ Alternating Least Squares (ALS)

○ FP-growth

● Clustering

○ K-Means

○ Gaussian mixture

○ Power Iteration clustering

○ Latent Dirichlet allocation

○ Streaming k-means

● Dimensionality reduction

○ Singular value decomposition (SVD)

○ Principal component analysis (PCA)

Page 48: Boosting big data with apache spark

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Tools provided by Spark

● Tools for basic statistics including

○ Summary statistics

○ Correlations

○ Sampling

○ Hypothesis testing

○ Random data generation

● Tools for feature extraction and transformation

○ Extracting features out of text

○ Uniform Vector format to store features

● Tools to build Machine Learning Pipelines

using Spark SQL

Page 49: Boosting big data with apache spark

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Why choose for MLLib?

● One of the best documented machine learning

libraries available for the JVM

● Simple API, constructs are the same for different

algorithms

● Well integrated with other Spark-components

Page 50: Boosting big data with apache spark

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Demo: Spark MLLib

Page 51: Boosting big data with apache spark

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.beData Science Company

Spark Streaming

Page 52: Boosting big data with apache spark

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Spark Streaming Overview

● Build around the concept of DStreams or discretized

streams

● Long-running Spark application

● Micro-batch architecture

● Supports Flume, Kafka, Twitter, Amazon Kinesis,

Socket, File…

Page 53: Boosting big data with apache spark

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

DStreams

● A sequence of RDDs

● Stateless transformations

● Stateful transformations

● Checkpointing

Page 54: Boosting big data with apache spark

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Spark Streaming Use Cases

● ETL and enrichment of streaming data on ingestion

● Lambda Architecture

● Operational dashboards

Page 55: Boosting big data with apache spark

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Demo: Spark Streaming

Page 56: Boosting big data with apache spark

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.beData Science Company

Spark on Amazon EC2

Page 57: Boosting big data with apache spark

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Apache Spark runs easily on Amazon EC2

Apache Spark comes with a script to launch Spark clusters

on Amazon EC2.

So there is no need to invest in a cluster of servers...

Furthermore it has support for multiple Amazon

components.

● Spark can read files from Amazon S3

● Spark Streaming can easily be integrated with Amazon

Kinesis

Page 58: Boosting big data with apache spark

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.beData Science Company

Conclusion

Page 59: Boosting big data with apache spark

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Why choose for Apache Spark?

● Modern integrated full-stack Big Data framework

● Suitable for both batch and (near) real time applications

● Well supported by a very large community

● The Big Data landscape seems to shift to Apache Spark

Page 60: Boosting big data with apache spark

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.beData Science Company

Questions?