Implementing the Lambda Architecture with Spring XD

By Carlos Queiroz

Advanced Field Engineer

Carlos Queiroz

Agenda

• Introduction to the Lambda Architecture • Applying Lambda architecture to a business case • Implementation details

What is Lambda Architecture?

Implementation of a set of desired properties on general purpose big data systems. !

Generic architecture addressing common requirements in big data applications. !

Set of design patterns for dealing with historical and operational data.

Approach is new, ideas not so…

Traditional / Relational

Data Sources

The Big Data Ecosystem

Analytics on Data at Rest

Data Warehouse

Analytics on Structured Data

Analytics onData in Motion

MapReduce like modelsNon-Traditional /Non-RelationalData Sources

Non-Traditional / Non-Relational

Data Feeds

Traditional / Relational

Data SourcesIn

Stream computing

Hadoop

Why Lambda Architecture?

To build a data system that can answer questions by running functions that take at the entire dataset as input. A general purpose data system

Desired Properties of such systems

Fault-tolerance

Generic

Scale linearly, horizontally.

Extensible

Be able to achieve low latency updates when necessary.

Be able to “ask” arbitrary questions to the system

How to support those properties?

Decompose the problem into pieces

Speed Layer

Batch Layer

Serving Layer

Batch Layer

Master dataset Batch Layer

Batch view

runBatchLayer() { while(true) recomputeFunctions()}

The Master dataset

Master dataset

Sourceof

Recompute

Writeonce

Protect

Readmany

Master dataset requirements

Operation Requisite Comments

Writes Efficient appends of new data

Basically add new pieces of data. Easy to append new data

Writes Scalable storage Need to handle possibly, petabytes of data

Reads Support for parallel processing

Functions usually work on the entire dataset. Need to support handling large amounts of data

Reads Vertically partition data Not necessary to look all the data all the time. Some functions may need to look at only relevant data (e.g. 1 week of calls)

Writes/Reads Costs for processing. Flexible storage

Storage costs money (a lot). Need flexibility on how to store and compress data.

The Master dataset

HDFS - Hadoop Distributed File System

Serving Layer

Makes the batch views “queryable”

Batch view

Batch updates and random reads

Distrib

uted data

Desired properties on the Serving layer

Batch view

High ThroughputLow Latency

Serving layer database requirements

Batch writable

Scalable

Random reads

Fault-tolerant

Batch and serving layers

Master dataset Batch Layer

Batch view

analytical functions,immutable storage

Serving Layer

pre-computed analytical functions

(Views)

Only property missing - low latency updates

Speed layer

Allows arbitrary functions computed on arbitrary data on (near) real-time.

Speed layer

RealTime FunctionRealTime FunctionRealTime FunctionRealTime FunctionRealTime FunctionRealTime FunctionRealTime ViewRealTime ViewRealTime ViewRealTime ViewRealTime ViewRealTime ViewRealTime View

Narrow but more up to date view

Depending of functions complexity incremental computation approach is recommended

Incremental computation

realtime view = function(new data, realtime view)

Eventual Accuracy

Some computations are harder to compute !

For such cases approximations are used. Results are approximate to the correct answer !

Sophisticated queries such as realtime machine learning are usually done with eventual accuracy. Too complex to be computed exactly.

Approximation & Randomisation

Approximation find an answer correct within some factor

(answer that is within 10% of correct result)

Randomisation allow a small probability of failure

(1 in 10,000 give the wrong answer)

Synopses structures

• Sampling • Sketches • Histograms • Wavelets

Library implementations • algebird (https://github.com/twitter/algebird) • samoa (https://github.com/yahoo/samoa/wiki)

http://charuaggarwal.net/synopsis.pdf

Stream processing (real-time function)

Run the realtime functions to update the realtime views

Data Ingestion Stream Processing (realtime function)

One-at-a-time

Divide your processing into worker processes, and put queues between the worker processes

24Queues and workers paradigm

Generalised one-at-a-time approach

• Works on a higher level • Stream computation defined as a graph (usually).

• Storm, InfoSphere Streams models • Filters and pipes

• Spring XD • At least once in case of failures.

cut -d" " -f1 < access.log | sort | uniq -c | sort -rn | less

Micro-batching

• Small batches of events processed one at a time !

• Exactly one processing !

• Implementations: • Spark Streaming

batch #1

batch #2Data Ingestion

Lambda Architecture

Ad-hoc queries

analytical functions

analytical functions,immutable storage

raw data

Data Ingestion

Speed Layer

Batch Layer Serving Layer

raw data

pre-computed analytical functions

Principles of Lambda Architecture

Store data in it’s rawest form

Immutability and perpetuity !

Re-computation !

Query = function(all data)

Implementing the Lambda Architecture

The ACM DEBS 2014 Grand Challenge1

To demonstrate the applicability of event-based systems to provide scalable, real-time analytics over high volume sensor data

1 http://www.cse.iitb.ac.in/debs2014/

ACM DEBS 2014 Challenge Application

Scenario !

Analysis of energy consumption measurement

Short-term load forecasting makes load forecasts based on current load and what was learned over historical data

Load statistics for real-time demand management finds outliers based on the energy consumption

house_id

dev_id

plug_id

Data model

Data source

Field Comments

ID UNIQUE IDENTIFIER

TIMESTAMP TIMESTAMP OF MEASUREMENT

VALUE MEASUREMENT

PROPERTY TYPE: 0 - WORK, 1 - LOAD

PLUG_ID UNIQUE IDENTIFIER OF A PLUG

HOUSEHOLD_ID UNIQUE IDENTIFIER OF A HOUSEHOLD

HOUSE_ID UNIQUE IDENTIFIER OF A HOUSE

Field CommentsTS TIMESTAMP OF STARTING TIME PREDICTION

HOUSE_ID HOUSE ID

PREDICTED_LOAD PREDICTED LOAD

PL - House

Field CommentsTS_START TIMESTAMP OF START TIME WINDOW

TS_STOP TIMESTAMP OF STO PTIME WINDOW

HOUSE_ID HOUSE ID

PERCENTAGE % PLUGS LOAD HIGHER THAN NORMAL

Outliers

HOUSE_ID HOUSE ID

HOUSEHOLD_ID

PLUG_ID

PL - Plug

Analytical models

L(si+2) = (avgLoad(si) +median(avgLoad(sj)))/2

Load prediction per house, per plug

OutliersFor each house calculate the percentage of plugs which have a median load during the last hour greater than the median load of all plugs (in all households of all houses) during the last hour

sj = s(i+2–n⇤k)

It is based on the average load of the current window and the median of the average loads of windows covering the same time of all past days

How others did

Storm-based implementation

How others did

Oracle DB + Oracle Event Processing (OEP)

How others did

http://lsds.doc.ic.ac.uk/projects/seep/debs14-gc

SEEP: Real-Time Stateful Big Data Processing

Implementation details

Fork from https://github.com/pivotalsoftware/gfxd-demo

Thanks to Jens Deppe and William Markito

Our approach - ACM DEBS 2014 system architecture

RabbitMQ Spring XD

Hadoop

GemFire XD Web App(Spring boot)

Batch layer

Serving layer

Speed layerSensorEventSensor

EventSensorEventSensor

EventSensorEvent

Data ingestion

Sensor Events

RabbitMQ

publishes

Stream App

Spring XD

subscribe

Speed layer

Application flow

Removeduplicates

Filter

Fix missingvaluestransform

Store toGFXD

Findoutliers

Load Forecast

Loadpred. model

Processor

Store toGFXD

OutliersModelProcessor

not a flow

Store toGFXD

RT viewRT view

RT view

RT viewRT view

RT view

Load Forecast

Loadpred. model

Processor

Store toGFXD

Sink RT viewRT view

RT view

Stream actions

Duplicatesfiltering

enrichment(compute time slice)

computemodel Produce RT View

Spring XD streams

pumpin - sends all data to a master queue

sensoreventenricher - Consumes the data from the queue, filter and transform the data before store on HDFS

findoutliers - Taps from master_ds stream to compute outlier model

loadpred{h,p} - Taps from master_ds stream to compute load prediction for house and plug (2 streams)

5 streams

Batch Layer

Batch job that starts from Spring XD Uses cron to run every X hour/min/day?

Load pred. model

Hadoop

Batch viewBatch view

Batch view

Hadoop based system

Store an immutable and constantly growing master dataset (of all datasets) !

Compute arbitrary functions (models) on the existing datasets. !

Essentially, runs the MR models.

Spring XD jobs

• A unique job to run the MR model to compute the historical aspect of the models.

• MR job runs every “day/hour/min” • Job is completely independent from the other streams

Load raw_sensortable

computemodel

updateavg, outliers

Serving layer (RT and Batch views)

HOUSE_ID HOUSE ID

Field CommentsTS_START TIMESTAMP OF START TIME WINDOW

TS_STOP TIMESTAMP OF STO PTIME WINDOW

HOUSE_ID HOUSE ID

PERCENTAGE % PLUGS LOAD HIGHER THAN NORMAL

Gemfire XD

Get updates from MR and SP models

Holds both batch and real-time views

HOUSE_ID HOUSE ID

HOUSEHOLD_ID

PLUG_ID

Web UI

Web App(Spring boot)

view view

Why Gemfire XD?

‣ Distributed database !

‣ Tightly integrated with Pivotal Hadoop !

‣ SQL support !

‣ Fault-tolerant !

‣ In-memory (fast access)

GemFire XD

Hadoop

column_name data_type

Model: Datahas_many_and_belongs_to_many :teams

Table1

Why Pivotal HD (PHD)

‣ Hadoop based !

‣ Tightly integrated with Gemfire XD

GemFire XD

Hadoop

Table1

Why Spring XD?

‣ Data Ingestion !

‣ Real-time Analytics !

‣ Workflow Orchestration !

‣ Integration

Full Open source implementation

HBase??

Hadoop

SpringXD

Is Lambda Architecture perfect?

• Suitable for specific use cases • Doesn't contemplate reference data access. • Not always possible to use same model for Real-time and Batch • Other ideas to improve the lambda architecture

• Multiple batch layers • incremental batch layers

Source code

https://bitbucket.org/caxqueiroz/debs2014-challenge

Thank you!!

Implementing the Lambda Architecture with Spring XD

Software

BLACK XD〈エックス・ディーアウトライン〉 OUTLINE - ARAIサイズ XD アウトライン・赤 XD アウトライン・青 XD アウトライン・黒（54） 4530935514731

christmast xd

SyntheticDNAfragmentsbearingICR … · 2018. 6. 29. · I,IV 1st lambda-MA-5S4 lambda-MA-3A2 2nd lambda-MA-5S1 lambda-MA-3A3 II,V 1st lambda-MA-5S5 lambda-MA-3A7 2nd lambda-MA-5S6

MBZDBCX-5 : XD : YazDa >4ññ-f7 Photo : XD L Package / Body … · XD) PROACTIVE, 25T PROACTIVES XD PROACTIVEs25S L L Packages XD L Package) 99 cx-e SKYACTIV-G 2500 DOHC 2WD (6EC-AT)

Implementing lambda expressions in Java - …wiki.jvmlangsummit.com/images/7/7b/Goetz-jvmls-lambda.pdf · Implementing lambda expressions in Java Brian Goetz Java Language Architect

XD CLOTHES

Anime XD!!!

Implementing a Dependently Typed Lambda Calculuswouters/Talks/ImplDepTypedLambdaCalc-Sheffield.pdfCurry-Howard Isomorphism • Dependent types provide a mathematical framework for

Mary Had a Little Lambda: Implementing a Minimal … Had a Little Lambda: Implementing a Minimal Lisp for ... and the most important features of this dialect. ... of his key discoveries

COMPACT HI-FI SYSTEM XD-SERIESmanual.kenwood.com/files/B60-5014-00.pdf · 2010. 9. 17. · COMPACT HI-FI SYSTEM XD-SERIES XD-855 / XD-855E / XD855 XD-755 / XD-755E / XD755 XD-655

JST-XD-1910 XD series JOYSTICKS

xD SPECIFICATIONS xD SPECIFICATIONS COLOR CHOICES …

COMPACT HI-FI SYSTEM XD-SERIES - KENWOODmanual.kenwood.com/files/B60-4573-00.pdf · COMPACT HI-FI SYSTEM INSTRUCTION MANUAL KENWOOD CORPORATION XD-SERIES XD-A83 XD-853 XD-803 This

Implementing a Highly Scalable Stock Prediction System with R, Apache Geode and Spring XD

COMPONENT SYSTEM/COMPACT HI-FI SYSTEM XD SERIESmanual.kenwood.com/files/B60-3768-00.pdf · XD SERIES XD-751/XD-701 XD-771S XD-551/XD-501 XD-571S. XD-SERIES (En) 2 Preparation section

grafeno XD

Costing XD

Implementing the speed layer in a lambda architecture - IT4BIit4bi.univ-tours.fr/it4bi/medias/pdfs/2016_Master_Thesis/... · Implementing the speed layer in a lambda architecture

Introduction to lambda expression & lambda calculus

Spring XD Guide · Spring XD 1.0.0 Spring XD Guide iv Single quotes, Double quotes, Escaping ..... 58 11. Streams