Implementing the Lambda Architecture with Spring XD

  • View
    2.068

  • Download
    7

  • Category

    Software

Preview:

DESCRIPTION

Recorded at SpringOne2GX 2014. Speaker: Carlos Queiroz Big Data Track The lambda architecture has been proposed as a general purpose data system that aims to solve the problem of computing arbitrary functions on an arbitrary dataset in (near) real-time. In this talk we introduce the lambda architecture and show how it can be implemented with SpringXD, GemFireXD and Hadoop (HDFS and MapReduce) as the foundation of the architecture implementation. To validate the architecture we introduce a CDR (Call Detail Record) mining application as use case of the lambda architecture. We finalise by showing a demo of the CDR (Call Detail Record) mining application.

Citation preview

© 2014 SpringOne 2GX. All rights reserved. Do not distribute without permission.

Implementing the Lambda Architecture with Spring XD

By Carlos Queiroz

Me

Advanced Field Engineer

2

Carlos Queiroz

Agenda

!

• Introduction to the Lambda Architecture • Applying Lambda architecture to a business case • Implementation details

3

What is Lambda Architecture?

Implementation of a set of desired properties on general purpose big data systems. !

Generic architecture addressing common requirements in big data applications. !

Set of design patterns for dealing with historical and operational data.

4

Approach is new, ideas not so…

5

Traditional / Relational

Data Sources

The Big Data Ecosystem

Stre

amin

g D

ata

Trad

ition

al

War

ehou

se

Analytics on Data at Rest

Data Warehouse

Analytics on Structured Data

Analytics onData in Motion

MapReduce like modelsNon-Traditional /Non-RelationalData Sources

Non-Traditional / Non-Relational

Data Feeds

Traditional / Relational

Data SourcesIn

tern

et-

Scal

e D

ata

Sets

Stream computing

Hadoop

Why Lambda Architecture?

To build a data system that can answer questions by running functions that take at the entire dataset as input. A general purpose data system

6

Desired Properties of such systems

Fault-tolerance

Generic

Scale linearly, horizontally.

Extensible

Be able to achieve low latency updates when necessary.

Be able to “ask” arbitrary questions to the system

7

How to support those properties?

Decompose the problem into pieces

8

Speed Layer

Batch Layer

Serving Layer

Batch Layer

9

Master dataset Batch Layer

Batch view

Batch view

Batch view

runBatchLayer() { while(true) recomputeFunctions()}

The Master dataset

10

Master dataset

Sourceof

Truth

Recompute

Writeonce

Protect

Readmany

Master dataset requirements

11

Operation Requisite Comments

Writes Efficient appends of new data

Basically add new pieces of data. Easy to append new data

Writes Scalable storage Need to handle possibly, petabytes of data

Reads Support for parallel processing

Functions usually work on the entire dataset. Need to support handling large amounts of data

Reads Vertically partition data Not necessary to look all the data all the time. Some functions may need to look at only relevant data (e.g. 1 week of calls)

Writes/Reads Costs for processing. Flexible storage

Storage costs money (a lot). Need flexibility on how to store and compress data.

The Master dataset

12

HDFS - Hadoop Distributed File System

Serving Layer

Makes the batch views “queryable”

13

Batch view

Batch view

Batch updates and random reads

Distrib

uted data

base

Desired properties on the Serving layer

14

Batch view

Batch view

High ThroughputLow Latency

Serving layer database requirements

15

Batch writable

Scalable

Random reads

Fault-tolerant

Batch and serving layers

16

Master dataset Batch Layer

Batch view

Batch view

Batch view

analytical functions,immutable storage

Serving Layer

pre-computed analytical functions

(Views)

Only property missing - low latency updates

Speed layer

Allows arbitrary functions computed on arbitrary data on (near) real-time.

17

Feed

Feed

Feed

Feed

Feed

Feed

Speed layer

18

RealTime FunctionRealTime FunctionRealTime FunctionRealTime FunctionRealTime FunctionRealTime FunctionRealTime ViewRealTime ViewRealTime ViewRealTime ViewRealTime ViewRealTime ViewRealTime View

Narrow but more up to date view

Depending of functions complexity incremental computation approach is recommended

Incremental computation

realtime view = function(new data, realtime view)

19

Eventual Accuracy

!

Some computations are harder to compute !

For such cases approximations are used. Results are approximate to the correct answer !

Sophisticated queries such as realtime machine learning are usually done with eventual accuracy. Too complex to be computed exactly.

20

Approximation & Randomisation

!

!

Approximation find an answer correct within some factor

(answer that is within 10% of correct result)

!

Randomisation allow a small probability of failure

(1 in 10,000 give the wrong answer)

!

!

21

Synopses structures

• Sampling • Sketches • Histograms • Wavelets

!

Library implementations • algebird (https://github.com/twitter/algebird) • samoa (https://github.com/yahoo/samoa/wiki)

22

http://charuaggarwal.net/synopsis.pdf

Stream processing (real-time function)

Run the realtime functions to update the realtime views

23

Data Ingestion Stream Processing (realtime function)

One-at-a-time

Divide your processing into worker processes, and put queues between the worker processes

24Queues and workers paradigm

Generalised one-at-a-time approach

• Works on a higher level • Stream computation defined as a graph (usually).

• Storm, InfoSphere Streams models • Filters and pipes

• Spring XD • At least once in case of failures.

25

cut -d" " -f1 < access.log | sort | uniq -c | sort -rn | less

Micro-batching

• Small batches of events processed one at a time !

• Exactly one processing !

• Implementations: • Spark Streaming

26

batch #1

batch #2Data Ingestion

Lambda Architecture

27

Ad-hoc queries

analytical functions

analytical functions,immutable storage

raw data

Data Ingestion

Speed Layer

Batch Layer Serving Layer

raw data

raw data

raw data

raw data

pre-computed analytical functions

Principles of Lambda Architecture

Store data in it’s rawest form

Immutability and perpetuity !

Re-computation !

Query = function(all data)

28

Implementing the Lambda Architecture

The ACM DEBS 2014 Grand Challenge1

29

To demonstrate the applicability of event-based systems to provide scalable, real-time analytics over high volume sensor data

1 http://www.cse.iitb.ac.in/debs2014/

ACM DEBS 2014 Challenge Application

Scenario !

Analysis of energy consumption measurement

30

Short-term load forecasting makes load forecasts based on current load and what was learned over historical data

Load statistics for real-time demand management finds outliers based on the energy consumption

plug

plug

house_id

dev_id

plug_id

Data model

31

Data source

Field Comments

ID UNIQUE IDENTIFIER

TIMESTAMP TIMESTAMP OF MEASUREMENT

VALUE MEASUREMENT

PROPERTY TYPE: 0 - WORK, 1 - LOAD

PLUG_ID UNIQUE IDENTIFIER OF A PLUG

HOUSEHOLD_ID UNIQUE IDENTIFIER OF A HOUSEHOLD

HOUSE_ID UNIQUE IDENTIFIER OF A HOUSE

Field CommentsTS TIMESTAMP OF STARTING TIME PREDICTION

HOUSE_ID HOUSE ID

PREDICTED_LOAD PREDICTED LOAD

PL - House

Field CommentsTS_START TIMESTAMP OF START TIME WINDOW

TS_STOP TIMESTAMP OF STO PTIME WINDOW

HOUSE_ID HOUSE ID

PERCENTAGE % PLUGS LOAD HIGHER THAN NORMAL

Outliers

Field CommentsTS TIMESTAMP OF STARTING TIME PREDICTION

HOUSE_ID HOUSE ID

HOUSEHOLD_ID

PLUG_ID

PREDICTED_LOAD PREDICTED LOAD

PL - Plug

Analytical models

32

L(si+2) = (avgLoad(si) +median(avgLoad(sj)))/2

Load prediction per house, per plug

OutliersFor each house calculate the percentage of plugs which have a median load during the last hour greater than the median load of all plugs (in all households of all houses) during the last hour

sj = s(i+2–n⇤k)

It is based on the average load of the current window and the median of the average loads of windows covering the same time of all past days

How others did

33

Storm-based implementation

How others did

34

Oracle DB + Oracle Event Processing (OEP)

How others did

35

http://lsds.doc.ic.ac.uk/projects/seep/debs14-gc

SEEP: Real-Time Stateful Big Data Processing

Implementation details

Fork from https://github.com/pivotalsoftware/gfxd-demo

36

Thanks to Jens Deppe and William Markito

Our approach - ACM DEBS 2014 system architecture

37

RabbitMQ Spring XD

Hadoop

GemFire XD Web App(Spring boot)

Batch layer

Serving layer

Speed layerSensorEventSensor

EventSensorEventSensor

EventSensorEvent

Data ingestion

38

Sensor Events

RabbitMQ

publishes

Stream App

Spring XD

subscribe

s

S

S

S

SS

SS

S

Speed layer

39

Application flow

Removeduplicates

Filter

Fix missingvaluestransform

Store toGFXD

Sink

Findoutliers

Tap

Load Forecast

Tap

Loadpred. model

Processor

Store toGFXD

Sink

OutliersModelProcessor

not a flow

Store toGFXD

Sink

RT viewRT view

RT view

RT viewRT view

RT view

Load Forecast

Tap

Loadpred. model

Processor

Store toGFXD

Sink RT viewRT view

RT view

P

H

Stream actions

40

Duplicatesfiltering

enrichment(compute time slice)

computemodel Produce RT View

Spring XD streams

41

pumpin - sends all data to a master queue

sensoreventenricher - Consumes the data from the queue, filter and transform the data before store on HDFS

findoutliers - Taps from master_ds stream to compute outlier model

loadpred{h,p} - Taps from master_ds stream to compute load prediction for house and plug (2 streams)

5 streams

Batch Layer

42

Batch job that starts from Spring XD Uses cron to run every X hour/min/day?

Load pred. model

Job

Hadoop

Batch viewBatch view

Batch view

Hadoop based system

Store an immutable and constantly growing master dataset (of all datasets) !

Compute arbitrary functions (models) on the existing datasets. !

Essentially, runs the MR models.

Spring XD jobs

• A unique job to run the MR model to compute the historical aspect of the models.

• MR job runs every “day/hour/min” • Job is completely independent from the other streams

43

1 job

Load raw_sensortable

computemodel

updateavg, outliers

table

Serving layer (RT and Batch views)

44

Field CommentsTS TIMESTAMP OF STARTING TIME PREDICTION

HOUSE_ID HOUSE ID

PREDICTED_LOAD PREDICTED LOAD

Field CommentsTS_START TIMESTAMP OF START TIME WINDOW

TS_STOP TIMESTAMP OF STO PTIME WINDOW

HOUSE_ID HOUSE ID

PERCENTAGE % PLUGS LOAD HIGHER THAN NORMAL

Gemfire XD

Get updates from MR and SP models

Holds both batch and real-time views

Field CommentsTS TIMESTAMP OF STARTING TIME PREDICTION

HOUSE_ID HOUSE ID

HOUSEHOLD_ID

PLUG_ID

PREDICTED_LOAD PREDICTED LOAD

Web UI

45

Web App(Spring boot)

view view

view

Why Gemfire XD?

‣ Distributed database !

‣ Tightly integrated with Pivotal Hadoop !

‣ SQL support !

‣ Fault-tolerant !

‣ In-memory (fast access)

46

GemFire XD

Hadoop

column_name data_type

column_name data_type

Model: Datahas_many_and_belongs_to_many :teams

Table1

column_name data_type

column_name data_type

Model: Datahas_many_and_belongs_to_many :teams

Table1

column_name data_type

column_name data_type

Model: Datahas_many_and_belongs_to_many :teams

Table1

column_name data_type

column_name data_type

Model: Datahas_many_and_belongs_to_many :teams

Table1

column_name data_type

column_name data_type

Model: Datahas_many_and_belongs_to_many :teams

Table1

column_name data_type

column_name data_type

Model: Datahas_many_and_belongs_to_many :teams

Table1

Why Pivotal HD (PHD)

47

‣ Hadoop based !

‣ Tightly integrated with Gemfire XD

GemFire XD

Hadoop

column_name data_type

column_name data_type

Model: Datahas_many_and_belongs_to_many :teams

Table1

column_name data_type

column_name data_type

Model: Datahas_many_and_belongs_to_many :teams

Table1

column_name data_type

column_name data_type

Model: Datahas_many_and_belongs_to_many :teams

Table1

column_name data_type

column_name data_type

Model: Datahas_many_and_belongs_to_many :teams

Table1

column_name data_type

column_name data_type

Model: Datahas_many_and_belongs_to_many :teams

Table1

column_name data_type

column_name data_type

Model: Datahas_many_and_belongs_to_many :teams

Table1

Why Spring XD?

‣ Data Ingestion !

‣ Real-time Analytics !

‣ Workflow Orchestration !

‣ Integration

48

Full Open source implementation

49

HBase??

Hadoop

SpringXD

Is Lambda Architecture perfect?

• Suitable for specific use cases • Doesn't contemplate reference data access. • Not always possible to use same model for Real-time and Batch • Other ideas to improve the lambda architecture

• Multiple batch layers • incremental batch layers

50

Source code

51

https://bitbucket.org/caxqueiroz/debs2014-challenge

Thank you!!

52

Recommended