a real-time architecture using Hadoop and Storm at Devoxx

Preview:

DESCRIPTION

 

Citation preview

@nathan_gs#DV13-#rtbigdata

A real-time architecture using Hadoop and Storm.

Nathan Bijnens

@nathan_gs#DV13-#rtbigdata

Speaker

Nathan BijnensDataCrunchers@nathan_gs

@nathan_gs#DV13-#rtbigdata

Our Vision

Big Data

test

Volume

@nathan_gs#DV13-#rtbigdata

Big Data

test

Velocity

@nathan_gs#DV13-#rtbigdata

Our Vision

Volume

test

Variety

@nathan_gs#DV13-#rtbigdata

Computing Trends

Source: Immutability Changes Everything - Pat Helland, RICON2012

Computation (CPUs) Expensive

Disk Storage Expensive

Coordination Easy(Latches Don’t Often Hit)

DRAM Expensive

Computation Cheap (Many Core Computers)

Disk Storage Cheap(Cheap Commodity Disks)

Coordination Hard(Latches Stall a Lot, etc)

DRAM / SSD Getting Cheap

Past Current

@nathan_gs#DV13-#rtbigdata

Credits

Nathan Marz•Ex-Backtype & Twitter

•Startup in Stealthmode

•Storm

•Cascalog

•ElephantDB

manning.com/marz

@nathan_gs#DV13-#rtbigdata

A Data System

@nathan_gs#DV13-#rtbigdata

Not all information is equal.

Some information is derived from other pieces of information.

Data is more than Information

@nathan_gs#DV13-#rtbigdata

Eventually you will reach the most ‘raw’ form of

information.This is the information you hold true,

simple because it exists.

Let’s call this ‘data’, very similar to ‘event’.

Data is more than Information

@nathan_gs#DV13-#rtbigdata

Events used to manipulate the master

data.

Events - Before

@nathan_gs#DV13-#rtbigdata

Today, events are the master data.

Events - After

@nathan_gs#DV13-#rtbigdata

Let’s store everything.

Data System

@nathan_gs#DV13-#rtbigdata

Data is Immutable

Events

@nathan_gs#DV13-#rtbigdata

Data is Time Based

Events

@nathan_gs#DV13-#rtbigdata

Capturing change traditionally

Person Location

Nathan Antwerp

Geert Dendermonde

John Ghent

Person Location

Nathan Ghent

Geert Dendermonde

John Ghent

@nathan_gs#DV13-#rtbigdata

Capturing change

Person Location Time

Nathan Antwerp 2005-01-01

Geert Dendermonde 2011-10-08

John Ghent 2010-05-02

Nathan Ghent 2013-02-03

Person Location Timestamp

Nathan Antwerp 2005-01-01

Geert Dendermonde

2011-10-08

John Ghent 2010-05-02

@nathan_gs#DV13-#rtbigdata

The data you query is often transformed, aggregated, ...

Rarely used in it’s original form.

Query

@nathan_gs#DV13-#rtbigdata

Query

Query = function ( all data )

@nathan_gs#DV13-#rtbigdata

Number of people living in each city.

Person Location Time

Nathan Antwerp 2005-01-01

Geert Dendermonde

2011-10-08

John Ghent 2010-05-02

Nathan Ghent 2013-02-03

Location Count

Ghent 2

Dendermonde 1

@nathan_gs#DV13-#rtbigdata

Query

All Data Query

@nathan_gs#DV13-#rtbigdata

Query: Precompute

All Data Query

Precomputed View

@nathan_gs#DV13-#rtbigdata

Layered Architecture

Speed Layer

Batch Layer

Serving Layer

@nathan_gs#DV13-#rtbigdata

Layered Architecture

HadoopElephan

tDB

Query

Incoming Data

Cassandra

@nathan_gs#DV13-#rtbigdata

Batch Layer

@nathan_gs#DV13-#rtbigdata

Batch Layer

HadoopElephan

tDB

Incoming Data

@nathan_gs#DV13-#rtbigdata

Unrestrained computation.

Batch Layer

@nathan_gs#DV13-#rtbigdata

No need to De-Normalize.

Batch Layer

@nathan_gs#DV13-#rtbigdata

Horizontal scalable.

Batch Layer

@nathan_gs#DV13-#rtbigdata

High Latency.Let’s pretend temporarily that update latency

doesn’t matter.

Batch Layer

@nathan_gs#DV13-#rtbigdata

Functional computation, based on immutable inputs, is idempotent.

Batch Layer

@nathan_gs#DV13-#rtbigdata

Stores master copy of data set...

Batch Layer

append only.

@nathan_gs#DV13-#rtbigdata

Batch Layer

@nathan_gs#DV13-#rtbigdata

Batch: View generation

Master Dataset

View #1

View #3

View #2

MapReduc

e

MapReduce

MapReduce

@nathan_gs#DV13-#rtbigdata

1. Take a large data set and divide it into subsets

2. Perform the same function on all subsets

3. Combine the output from all subsets

Output

MA

PR

ED

UC

E

MapReduce

DoWork() DoWork() DoWork()…

@nathan_gs#DV13-#rtbigdata

MapReduce

@nathan_gs#DV13-#rtbigdata

Serialization & Schema

Catch errors as quickly as they happen. Validation on write vs on read.

@nathan_gs#DV13-#rtbigdata

Serialization & Schema

CSV is actually a serialization language that is just poorly defined.

@nathan_gs#DV13-#rtbigdata

Serialization & Schema

•Use a format with a schema.

•Thrift

•Avro

•Protobuffers

•Added bonus: it’s faster & uses less space.

@nathan_gs#DV13-#rtbigdata

Read only database.No random writes required.

Batch View Database

@nathan_gs#DV13-#rtbigdata

Every iteration produces the Views from scratch.

Batch View Database

@nathan_gs#DV13-#rtbigdata

Batch View Database

•ElephantDB

•Splout

•Voldemort

•…

@nathan_gs#DV13-#rtbigdata

Batch Layer

Not yet absorbe

d.Data absorbed into Batch Views

Time Now

We are not done yet…Just a few hours of data.

@nathan_gs#DV13-#rtbigdata

Speed Layer

@nathan_gs#DV13-#rtbigdata

Overview

HadoopElephan

tDB

Incoming Data

Cassandra

@nathan_gs#DV13-#rtbigdata

Stream processing.

Speed Layer

@nathan_gs#DV13-#rtbigdata

Continuous computation.

Speed Layer

@nathan_gs#DV13-#rtbigdata

Transactional.

Speed Layer

@nathan_gs#DV13-#rtbigdata

Storing a limited window of data.

Compensating for the last few hours of data.

Speed Layer

@nathan_gs#DV13-#rtbigdata

All the complexity is isolated in the Speed layer.

If anything goes wrong, it’s auto-corrected.

Speed Layer

@nathan_gs#DV13-#rtbigdata

CAP

You have a choice between:

•Availability

•Queries are eventual consistent.

•Consistency

•Queries are consistent.

Consistency

Availability

Partition Toleranc

e

@nathan_gs#DV13-#rtbigdata

Some algorithms are hard to implement in real time. For those

cases we could estimate the results.

Eventual accuracy

@nathan_gs#DV13-#rtbigdata

Speed Layer

Incoming Data

Real Time

View 1

Real Time

View 2

@nathan_gs#DV13-#rtbigdata

Storm

•Message passing.

•Distributed processing.

•Horizontally scalable.

•Incremental algorithms.

•Fast.

•Data in motion.

@nathan_gs#DV13-#rtbigdata

Storm

@nathan_gs#DV13-#rtbigdata

Storm

•Tuple

•Stream

@nathan_gs#DV13-#rtbigdata

Storm

•Spout

•Bolt

@nathan_gs#DV13-#rtbigdata

Storm

•Grouping

@nathan_gs#DV13-#rtbigdata

Data Ingestion

•Kafka

•Flume

•Scribe

•*MQ

•…

@nathan_gs#DV13-#rtbigdata

Speed Layer Views

•The views are stored in Read & Write database.

•Cassandra

•Hbase

•Redis

•MySQL

•ElasticSearch

•…

•Much more complex than a read only view.

@nathan_gs#DV13-#rtbigdata

Serving Layer

@nathan_gs#DV13-#rtbigdata

Overview

HadoopElephan

tDB

Query

Incoming Data

Cassandra

@nathan_gs#DV13-#rtbigdata

Random reads

Serving Layer

@nathan_gs#DV13-#rtbigdata

This layer queries the Batch & Real Time views and merges it.

Serving Layer

@nathan_gs#DV13-#rtbigdata

Serving Layer

Real Time Views

Merge

Batch Views

@nathan_gs#DV13-#rtbigdata

How to query an Average?

Serving Layer

@nathan_gs#DV13-#rtbigdata

Overview

@nathan_gs#DV13-#rtbigdata

Overview

HadoopElephan

tDB

Query

Incoming Data

Cassandra

@nathan_gs#DV13-#rtbigdata

CQRS

Source: martinfowler.com/bliki/CQRS.html – Martin Fowler

@nathan_gs#DV13-#rtbigdata

Lambda Architecture

@nathan_gs#DV13-#rtbigdata

Lambda Architecture

Can discard any view, batch and real time, and just recreate everything

from the master data.

@nathan_gs#DV13-#rtbigdata

Lambda Architecture

Mistakes are corrected via recomputation.

Write bad data? Remove the data & recompute.

Bug in view generation? Just recompute the view.

@nathan_gs#DV13-#rtbigdata

Lambda Architecture

Data storage is highly optimized.

@nathan_gs#DV13-#rtbigdata

Lambda Architecture

Immutability changes everything.

@nathan_gs#DV13-#rtbigdata

Questions?

Questions?@nathan_gs & #DV13

slideshare.net/nathan_gs

nathan@datacrunchers.eu

@nathan_gs#DV13-#rtbigdata

DataCrunchers

We enable companies in envisioning, defining and implementing a data strategy.

A one-stop-shop for all your Big Data needs.

The first Big Data Consultancy agency in Belgium.

@nathan_gs#DV13-#rtbigdata

Thank you

Thank you@nathan_gs

nathan@datacrunchers.eu