The three generations of Big Data processing

The three generations of

Big Data processing

Rubén Casado

[email protected]

1. Big Data processing

2. Batch processing

3. Real-time processing

4. Hybrid computation model

5. Conclusions

Agenda

About me :-)

PhD in Software Engineering

MSc in Computer Science

BSc in Computer Science

Academics

Work

Experience

About Treelogic

Treelogic is an R&D

intensive company with

the mission of creating,

boosting, developing and

adapting scientific and

technological

knowledge to improve

quality standards in our

daily life

TREELOGIC – Distributor and Sales

International Projects

National Projects

Regional Projects

R&D Manag.

System

Internal Projects

Research Lines

Computer Vision

Big Data

Teraherzt technology

Data science

Social Media Analysis

Semantics

Security & Safety

Justice

Health

Transport

Financial services

ICT tailored solutions

Solutions

R&D

7 ongoing FP7 projects

ICT, SEC, OCEAN

Coordinating 5 of them

3 ongoing Eurostars projects

Coordinating all of them

Research

INNOVATION &

7 years’ experience in R&D projects

www.datadopter.com


2. Batch processing



5. Conclusions

Agenda

A massive volume of both

structured and unstructured data

that is so large to process with

traditional database and software

techniques

What is Big Data?

Big Data are high-volume, high-velocity,

and/or high-variety information assets that

require new forms of processing to enable

enhanced decision making, insight

discovery and process optimization

How is Big Data?

- Gartner IT Glossary -

3 problems

Volume

Variety Velocity

3 solutions

Batch processing

NoSQL Real-time

processing

3 solutions

Batch processing

NoSQL Real-time

processing

• Scalable

• Large amount of static data

• Distributed

• Parallel

• Fault tolerant

• High latency

Batch processing

Volume

• Low latency

• Continuous unbounded

streams of data

• Distributed

• Parallel

• Fault-tolerant

Real-time processing

Velocity

• Low latency

• Massive data + Streaming data

• Scalable

• Combine batch and real-time results

Hybrid computation model

Volume Velocity

All data

New data

Batch processing


Batch results

Stream results

Combination Final results

Hybrid computation model

Batch processing

Large amount of statics data

Scalable solution

Volume


Computing streaming data

Low latency

Velocity

Hybrid computation

Lambda Architecture

Volume + Velocity

2006

2010

2014

1ª Generation

2ª Generation

3ª Generation

Inception

2003 Processing Paradigms

Batch

10 years of Big Data

processing technologies

2003 2004 2005 2013 2011 2010 2008

The Google File System

MapReduce: Simplified Data Processing on Large Clusters

Doug Cutting starts developing Hadoop

2006

Yahoo! starts working on Hadoop

Apache Hadoop is in production Nathan Marz

creates Storm

Yahoo! creates S4

2009

Facebook creates Hive

Yahoo! creates Pig

Google publishes MillWheel: Fault-Tolerant Stream Processing at Internet Scale

LinkedIn presents Samza

LinkedIn! presents KafkA

Cloudera presents Flume

2012

Nathan Marz defines the Lambda Architecture

Real-Time Hybrid

Processing Pipeline

DATA

ACQUISITION

DATA

STORAGE

DATA

ANALYSIS RESULTS

Static stations and mobile sensors in Asturias sending streaming data

Historical data of > 10 years

Monitoring, trends identification, predictions

Air Quality case study

1. Big Data processing overview

2. Batch processing



5. Conclusions

Agenda

Batch processing technologies

DATA

ACQUISITION

DATA

STORAGE

DATA

ANALYSIS RESULTS

o HDFS commands

o Sqoop

o Flume

o Scribe

o HDFS

o HBase

o MapReduce

o Hive

o Pig

o Cascading

o Spark

o Shark

• Import to HDFS

hadoop dfs -copyFromLocal

<path-to-local> <path-to-remote>

hadoop dfs –copyFromLocal /home/hduser/AirQuality/ /hdfs/AirQuality/

HDFS commands DATA

ACQUISITION

B

A

T

C

H

• Tool designed for transferring data between

HDFS/HBase and structural datastores

• Based in MapReduce

• Includes connectors for multiple databases

o MySQL,

o PostgreSQL,

o Oracle,

o SQL Server and

o DB2

o Generic JDBC connector

• Java API

Sqoop DATA

ACQUISITION

B

A

T

C

H

import -all-tables --connect

jdbc:mysql://localhost/testDatabase

--target-dir

hdfs://rootHDFS/testDatabase --

username user1 --password pass1 -m 1

1) Import data from database to HDFS

export --connect

jdbc:mysql://localhost/testDatabase

--export-dir

hdfs://rootHDFS/testDatabase --

username user1 --password pass1 -m 1

3) Export results to database

2)

Ana

lyze d

ata

(H

AD

OO

P)

Sqoop DATA

ACQUISITION

B

A

T

C

H

• Service for collecting, aggregating, and moving

large amounts of log data

• Simple and flexible architecture based on

streaming data flows

• Reliability, scalability, extensibility, manageability

• Support log stream types

o Avro

o Syslog

o Netcast

Flume DATA

ACQUISITION

B

A

T

C

H

Sources Channels Sinks

Avro Memory HDFS

Thrift JDBC Logger

Exec File Avro

JMS Thrift

NetCat IRC

Syslog

TCP/UDP

File Roll

HTTP Null

HBase

Custom Custom

• Architecture o Source

• Waiting for events .

o Sink

• Sends the information towards

another agent or system.

o Channel

• Stores the information until it is

consumed by the sink.

Flume DATA

ACQUISITION

B

A

T

C

H

Stations send the information to the servers. Flume collects

this information and move it into the HDFS for further analsys

Air quality syslogs

Flume DATA

ACQUISITION

B

A

T

C

H

Station; Tittle; latitude; longitude; Date ; SO2; NO; CO; PM10; O3; dd; vv; TMP; HR; PRB;

"1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "8"; "0.35"; "13"; "67"; "158"; "3.87"; "18.8"; "34"; "982";

"1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "7"; "0.32"; "16"; "66"; "158"; "4.03"; "19"; "35"; "981"; "23";

"1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "6"; "0.26"; "24"; "68"; "158"; "3.76"; "19.1"; "36"; "980"; "23";



• Server for aggregating log data streamed in real time from

a large number of servers

• There is a scribe server running on every node in the

system, configured to aggregate messages and send them

to a central scribe server (or servers) in larger groups.

• The central scribe server(s) can write the messages to the

files that are their final destination

Scribe DATA

ACQUISITION

B

A

T

C

H

category=‘mobile‘;

// '1; 43.5298; -5.6734; 2000-01-01; 23; 89; 1.97; …' message= sensor_log.readLine();

log_entry = scribe.LogEntry(category, message)

// Create a Scribe Client

client = scribe.Client(iprot=protocol, oprot=protocol)

transport.open()

result = client.Log(messages=[log_entry])

transport.close()

• Sending a sensor message to a Scribe Server

Scribe DATA

ACQUISITION

B

A

T

C

H

• Distributed FileSystem for Hadoop

• Master-Slaves Architecture (NameNode – DataNodes)

o NameNode: Manage the directory tree and regulates access to files by clients

o DataNodes: Store the data

• Files are split into blocks of the same size and these blocks are stored and replicated in a set of DataNodes

HDFS DATA

STORAGE

B

A

T

C

H

• Open-source non-relational distributed column-oriented

database modeled after Google’s BigTable.

• Random, realtime read/write access to the data.

• Not a relational database.

o Very light «schema»

• Rows are stored in sorted order.

DATA

STORAGE

B

A

T

C

H

HBase

• Framework for processing large amount of data in parallel

across a distributed cluster

• Slightly inspired in the Divide and Conquer (D&C) classic strategy

• Developer has to implement Map and Reduce functions:

o Map: It takes the input, partitions it up into smaller sub-problems, and

distributes them to worker nodes parsed to the format <K, V>

o Reduce: It collects the <K, List(V)> and generates the results

MapReduce DATA

ANALYTICS

B

A

T

C

H

• Design Patterns

o Joins

o Reduce side Join

o Replicated join

o Semi join

o Sorting:

o Secondary sort

o Total Order Sort

o Filtering

MapReduce

o Statistics

o AVG

o VAR

o Count

o …

o Top-K

o Binning

o …

DATA

ANALYTICS

B

A

T

C

H

• Obtain the S02 average of each station

MapReduce







DATA

ANALYTICS

B

A

T

C

H

Input Data

Mapper

Mapper

Mapper

<1, 6> …

…

…

Shufflin

g

<1, 2> <3, 1> <1, 9>

<3, 9> <2, 6> <2, 6> <1, 6>

<2, 0> <2, 8> <1, 2> <3,9>

<Station_ID, S02_VALUE>

MapReduce DATA

ANALYTICS

B

A

T

C

H

• Maps get records and produce the SO2 value in

<Station_Id, SO2_value>

Station_ID, AVG_SO2

1, 2,013

2, 2,695

3, 3,562

Reducer

Sum

Divide

Sh

ufflin

g

Reducer

Sum

Divide

…

<Station_ID, [SO1, SO2,…,SOn>

• Reducer receives <Station_Id, List<SO2_value> >

and computes the average for the station

MapReduce DATA

ANALYTICS

B

A

T

C

H

Hive

• Hive is a data warehouse system for Hadoop

that facilitates easy data summarization, ad-hoc

queries, and the analysis of large datasets

• Abstraction layer on top of MapReduce

• SQL-like language called HiveQL.

• Metastore: Central repository of Hive metadata.

DATA

ANALYTICS

B

A

T

C

H

CREATE TABLE air_quality(Estacion int, Titulo string, latitud double, longitud double, Fecha string, SO2 int, NO int, CO float, …)

ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘;' LINES TERMINATED BY '\n'

STORED AS TEXTFILE;

LOAD DATA INPATH '/CalidadAire_Gijon' OVERWRITE INTO TABLE calidad_aire;

Hive


SELECT Titulo, avg(SO2)

FROM air_quality

GROUP BY Estacion

DATA

ANALYTICS

B

A

T

C

H

• Platform for analyzing large data sets

• High-level language for expressing data

analysis programs. Pig Latin. Data flow

programming language.

• Abstraction layer on top of MapReduce

• Procedural language

Pig DATA

ANALYTICS

B

A

T

C

H

Pig DATA

ANALYTICS

B

A

T

C

H


calidad_aire = load '/CalidadAire_Gijon' using PigStorage(';')

AS (estacion:chararray, titulo:chararray, latitud:chararray,

longitud:chararray, fecha:chararray, so2:chararray,

no:chararray, co:chararray, pm10:chararray, o3:chararray,

dd:chararray, vv:chararray, tmp:chararray, hr:chararray,

prb:chararray, rs:chararray, ll:chararray, ben:chararray,

tol:chararray, mxil:chararray, pm25:chararray);

grouped = GROUP air_quality BY estacion;

avg = FOREACH grouped GENERATE group, AVG(so2);

dump avg;

• Cascading is a data processing API and

processing query planner used for defining,

sharing, and executing data-processing

workflows

• Makes development of complex Hadoop

MapReduce workflows easy

• In the same way that Pig

DATA

ANALYTICS

B

A

T

C

H

Cascading

// define source and sink Taps.

Tap source = new Hfs( sourceScheme, inputPath );

Scheme sinkScheme = new TextLine( new Fields( “Estacion", “SO2" ) ); Tap sink = new Hfs( sinkScheme, outputPath, SinkMode.REPLACE );

Pipe assembly = new Pipe( “avgSO2" ); assembly = new GroupBy( assembly, new Fields( “Estacion" ) );

// For every Tuple group

Aggregator avg = new Average( new Fields( “SO2" ) ); assembly = new Every( assembly, avg );

// Tell Hadoop which jar file to use

Flow flow = flowConnector.connect( “avg-SO2", source, sink, assembly );

// execute the flow, block until complete

flow.complete();

DATA

ANALYTICS

B

A

T

C

H


Cascading

Spark

• Cluster computing systems for faster data analytics

• Not a modified version of Hadoop

• Compatible with HDFS

• In-memory data storage for very fast iterative

processing

• MapReduce-like engine

• API in Scala, Java and Python

DATA

ANALYTICS

B

A

T

C

H

Spark DATA

ANALYTICS

B

A

T

C

H

• Hadoop is slow due to replication, serialization

and IO tasks

Spark DATA

ANALYTICS

B

A

T

C

H

• 10x-100x faster

Shark

• Large-scale data warehouse system for Spark

• SQL on top of Spark

• Actually Hive QL over Spark

• Up to 100 x faster than Hive

DATA

ANALYTICS

B

A

T

C

H

Pros

• Faster than Hadoop ecosystem

• Easier to develop new applications

o (Scala, Java and Python API)

Cons

• Not tested in extremely large clusters yet

• Problems when Reducer’s data does not fit in memory

DATA

ANALYTICS

B

A

T

C

H

Spark / Shark


2. Batch processing



5. Conclusions

Agenda

Real-time processing technologies

DATA

ACQUISITION

DATA

STORAGE

DATA

ANALYSIS RESULTS

o Flume o Kafka

o Kestrel

o Flume

o Storm

o Trident

o S4

o Spark Streaming

Flume DATA

ACQUISITION

R

E

A

L

• Kafka is a distributed, partitioned, replicated commit log service

o Producer/Consumer model

o Kafka maintains feeds of messages in categories called topics

o Kafka is run as a cluster

Kafka DATA

STORAGE

R

E

A

L

Insert AirQuality sensor log file into Kafka

cluster and consume the info.

// new Producter

Producer<String, String> producer = new Producer<String, String>(config);

//Open sensor log file

BufferedReader br… String line;

while(true)

{

line = br.readLine();

if(line ==null)

… //wait; else

producer.send(new KeyedMessage<String, String>(topic, line));

}

Kafka DATA

STORAGE

R

E

A

L

AirQuality Consumer

ConsumerConnector consumer = Consumer.createJavaConsumerConnector(config);

Map<String, Integer> topicCountMap = new HashMap<String,

Integer>();

topicCountMap.put(topic, new Integer(1));

Map<String, List<KafkaMessageStream>> consumerMap = consumer.createMessageStreams(topicCountMap);

KafkaMessageStream stream = consumerMap.get(topic).get(0);

ConsumerIterator it = stream.iterator();

while(it.hasNext()){

// consume it.next()

Kafka DATA

STORAGE

R

E

A

L

• Simple distributed message queue

• A single Kestrel server has a set of queues (strictly-ordered FIFO)

• On a cluster of Kestrel servers, they don’t know about each other and don’t do any cross communication

• Kestrel vs Kafka

o Kafka consumers cheaper (basically just the bandwidth usage)

o Kestrel does not depend on Zookeeper which means it is operationally

less complex if you don't already have a zookeeper installation.

o Kafka has significantly better throughput.

o Kestrel does not support ordered consumption

Kestrel DATA

STORAGE

R

E

A

L

Interceptor

• Interface org.apache.flume.interceptor.Interceptor

• Can modify or even drop events based on any criteria

• Flume supports chaining of interceptors.

• Types:

o Timestamp interceptor

o Host interceptor

o Static interceptor

o UUID interceptor

o Morphline interceptor

o Regex Filtering interceptor

o Regex Extractor interceptor

DATA

ANALYTICS

R

E

A

L

Flume

• The sensors’ information must be filtered by "Station 2" o An interceptor will filter information between Source and Channel.







DATA

ANALYTICS

R

E

A

L

Flume

# Write format can be text or writable

… #Defining channel – Memory type …1 … #Defining source – Syslog … … # Defining sink – HDFS … … #Defining interceptor

agent.sources.source.interceptors = i1

agent.sources.source.interceptors.i1.type = org.apache.flume.interceptor.StationFilter

class StationFilter implements Interceptor

… if(!"Station".equals("2"))

discard data;

else

save data;

DATA

ANALYTICS

R

E

A

L

Flume

Hadoop Storm

JobTracker Nimbus

TaskTracker Supervisor

Job Topology

• Distributed and scalable realtime computation system

• Doing for real-time processing what Hadoop did for batch processing

• Topology: processing graph. Each node contains processing logic (spouts and bolts). Links between nodes are streams of data

o Spout: Source of streams. Read a data source and emit the data into the

topology as a stream

o Bolts: Processing unit. Read data from several streams, does some

processing and possibly emits new streams

o Stream: Unbounded sequence of tuples. Tuples can contain any

serializable object

Storm DATA

ANALYTICS

R

E

A

L

CAReader LineProcessor AvgValues

• AirQuality average values

oStep 1: build the topology

Storm DATA

ANALYTICS

R

E

A

L

Spout Bolt Bolt


oStep 1: build the topology

TopologyBuilder AirAVG= new TopologyBuilder();

builder.setSpout("ca-reader", new CAReader(), 1);

//shuffleGrouping -> even distribution

AirAVG.setBolt("ca-line-processor", new LineProcessor(), 3)

.shuffleGrouping("ca-reader");

//fieldsGrouping -> fields with the same value goes to the same task

AirAVG.setBolt("ca-avg-values", new AvgValues(), 2)

.fieldsGrouping("ca-line-processor", new Fields("id"));

Storm DATA

ANALYTICS

R

E

A

L

public void open(Map conf, TopologyContext context, SpoutOutputCollector collector) {

//Initialize file

BufferedReader br = new … … }

public void nextTuple() { String line = br.readLine();

if (line == null) {

return;

} else

collector.emit(new Values(line));

}

Storm


oStep 2: CAReader implementation (IRichSpout interface)

DATA

ANALYTICS

R

E

A

L

public void declareOutputFields (OutputFieldsDeclarer declarer)

{

declarer.declare(new

Fields("id", "stationName", "lat", … }

public void execute (Tuple input, BasicOutputCollector collector)

{

collector.emit(new

Values(input.getString(0).split(";");

}

Storm


oStep 3: LineProcessor implementation (IBasicBolt interface)

DATA

ANALYTICS

R

E

A

L

69

public void execute (Tuple input, BasicOutputCollector collector)

{

//totals and count are hashmaps with each station accumulated values

if (totals.containsKey(id)) {

item = totals.get(id);

count = counts.get(id);

}

else {

//Create new item

}

//update values

item.setSo2(item.getSo2()+Integer.parseInt(input.getStringByField("so2")));

item.setNo(item.getNo()+Integer.parseInt(input.getStringByField("no")));

… }

Storm


oStep 4: AvgValues implementation (IBasicBolt interface)

DATA

ANALYTICS

R

E

A

L

• High level abstraction on top of Storm

o Provides high level operations (joins, filters,

projections, aggregations, functions…)

Pros o Easy, powerful and flexible

o Incremental topology development

o Exactly-once semantics

Cons o Very few built-in functions

o Lower performance and higher latency than Storm

Trident DATA

ANALYTICS

R

E

A

L

Simple Scalable Streaming System

Distributed, Scalable, Fault-tolerant platform for processing continuous unbounded streams of data

Inspired by MapReduce and Actor models of computation

o Data processing is based on Processing Elements (PE)

o Messages are transmitted between PEs in the form of events (Key, Attributes)

o Processing Nodes are the logical hosts to PEs

DATA

ANALYTICS

R

E

A

L

S4

…

<bean id="split" class="SplitPE">

<property name="dispatcher" ref="dispatcher"/>

<property name="keys">



<list>

<value>LogLines *</value>

</list>

</property>

</bean>

<bean id="average" class="AveragePE">

<property name="keys">

<list>

<value>CAItem stationId</value>

</list>

</property>

</bean> …


S4 DATA

ANALYTICS

R

E

A

L

Spark Streaming

• Spark for real-time processing

• Streaming computation as a series of very short

batch jobs (windows)

• Keep state in memory

• API similar to Spark

DATA

ANALYTICS

R

E

A

L


2. Batch processing



5. Conclusions

Agenda

• We are in the beginning of this generation

• Short-term Big Data processing goal

• Abstraction layer over the Lambda Architecture

• Promising technologies

o SummingBird

o Lambdoop

Hybrid Computation Model

SummingBird

• Library to write MapReduce-like process that can

be executed on Hadoop, Storm or hybrid model

• Scala syntaxis

• Same logic can be executed in batch, real-time

and hybrid bath/real mode

HYBRID

COMPUTATION

MODEL

SummingBird HYBRID

COMPUTATION

MODEL

Pros

• Hybrid computation model

• Same programing model for all proccesing paradigms

• Extensible

Cons

• MapReduce-like programing

• Scala

• Not as abstract as some users would like

SummingBird HYBRID

COMPUTATION

MODEL

Software abstraction layer over Open Source technologies

o Hadoop, HBase, Sqoop, Flume, Kafka, Storm, Trident

Common patterns and operations (aggregation, filtering, statistics…) already implemented. No MapReduce-like process

Same single API for the three processing paradigms

o Batch processing similar to Pig / Cascading

o Real time processing using built-in functions easier than Trident

o Hybrid computation model transparent for the developer

Lambdoop HYBRID

COMPUTATION

MODEL

Lambdoop

Data Operation Data

Workflow

Streaming data

Static data

HYBRID

COMPUTATION

MODEL

DataInput db_historical = new StaticCSVInput(URI_db);

Data historical = new Data (db_historical);

Workflow batch = new Workflow (historical);

Operation filter = new Filter (“Station", “=", 2); Operation select = new Select (“Titulo“, “SO2"); Operation group = new Group(“Titulo"); Operation average = new Average (“SO2");

batch.add(filter);

batch.add(select);

batch.add(group);

batch.add(variance);

batch.run();

Data results = batch.getResults();

…

Lambdoop HYBRID

COMPUTATION

MODEL

DataInput stream_sensor = new StreamXMLInput(URI_sensor);

Data sensor = new Data(stream_sensor)

Workflow streaming = new Workflow (sensor, new WindowsTime(100) );

Operation filter = new Filter ("Station", "=", 2);

Operation select = new Select ("Titulo", "S02");

Operation group = new Group("Titulo");

Operation average = new Average ("S02");

streaming.add(filter);

streaming.add(select);

streaming.add(group);

streaming.add(average);

streaming.run();

While (true) { Data live_results = streaming.getResults(); … }

Lambdoop HYBRID

COMPUTATION

MODEL

DataInput historical= new StaticCSVInput(URI_folder);

DataInput stream_sensor= new StreamXMLInput(URI_sensor);

Data all_info = new Data (historical, stream_sensor);

Workflow hybrid = new Workflow (all_info, new WindowsTime(1000) );

Operation filter = new Filter ("Station", "=", 2);

Operation select = new Select ("Titulo", "SO2");

Operation group = new Group("Titulo");

Operation average = new Average ("SO2");

hybrid.add(filter);

hybrid.add(select);

hybrid.add(group);

hybrid.add(variance);

hybrid.run();

Data updated_results = hybrid.getResults();

Lambdoop HYBRID

COMPUTATION

MODEL

Pros

• High abstraction layer for all processing model

• All steps in the data processing pipeline

• Same Java API for all programing paradigms

• Extensible

Cons

• Ongoing project

• Not open-source yet

• Not tested in larger cluster yet

Lambdoop HYBRID

COMPUTATION

MODEL


2. Batch processing



5. Conclusions

Agenda

Conclusions

• Big Data is not only Hadoop

• Identify the processing requirements of your

project

• Analyze the alternatives for all steps in the

data pipeline

• The battle for real-time processing is open

• Stay tuned for the hybrid computation model

Thanks for your attention!

www.datadopter.com

www.treelogic.com

Contact us:

[email protected]

[email protected]

MADRID Avda. de Manoteras, 38

Oficina D507

28050 Madrid · España

ASTURIAS Parque Tecnológico de Asturias

Parcela 30

33428 Llanera - Asturias · España

902 286 386

Technology

The three generations of Big Data processing