75
A real-time architecture using Hadoop and Storm.

A real-time architecture using Hadoop and Storm @ JAX London

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop and Storm.

Page 2: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop & Storm. #JaxLondon 2

Speaker

Nathan Bijnens@nathan_gs

Page 3: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop & Storm. #JaxLondon 3

Our Vision

Big Data

test

Volume

Page 4: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop & Storm. #JaxLondon 4

Big Data

test

Velocity

Page 5: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop & Storm. #JaxLondon 5

Our Vision

Volume

test

Variety

Page 6: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop & Storm. #JaxLondon 6

Computing Trends

Source: Immutability Changes Everything - Pat Helland, RICON2012

Computation (CPUs) Expensive

Disk Storage Expensive

Coordination Easy(Latches Don t Often Hit)

DRAM Expensive

Computation Cheap (Many Core Computers)

Disk Storage Cheap(Cheap Commodity Disks)

Coordination Hard(Latches Stall a Lot, etc)

DRAM / SSD Getting Cheap

Past Current

Page 7: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop & Storm. #JaxLondon 7

Credits

Nathan MarzEx-Backtype & Twitter

Startup in Stealthmode

Storm

Cascalog

ElephantDB

manning.com/marz

Page 8: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop & Storm. #JaxLondon 8

A Data System

Page 9: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop & Storm. #JaxLondon 9

Not all information is equal. Some information is derived from other pieces of

information.

Data is more than Information

Page 10: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop & Storm. #JaxLondon 10

Eventually you will reach the most

This is the information you hold true, simple because it exists.

Data is more than Information

Page 11: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop & Storm. #JaxLondon 11

Events used to manipulate the master data.

Events - Before

Page 12: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop & Storm. #JaxLondon 12

Today, events are the master data.

Events - After

Page 13: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop & Storm. #JaxLondon 13

everything.

Data System

Page 14: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop & Storm. #JaxLondon 14

Data is Immutable

Events

Page 15: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop & Storm. #JaxLondon 15

Data is Time Based

Events

Page 16: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop & Storm. #JaxLondon 16

Capturing change traditionally

Person Location

Nathan Antwerp

Geert Dendermonde

John Ghent

Person Location

Nathan Ghent

Geert Dendermonde

John Ghent

Page 17: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop & Storm. #JaxLondon 17

Capturing change

Person Location Time

Nathan Antwerp 2005-01-01

Geert Dendermonde 2011-10-08

John Ghent 2010-05-02

Nathan Ghent 2013-02-03

Person Location Timestamp

Nathan Antwerp 2005-01-01

Geert Dendermonde 2011-10-08

John Ghent 2010-05-02

Page 18: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop & Storm. #JaxLondon 18

The data you query is often transformed, aggregated, ...

Query

Page 19: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop & Storm. #JaxLondon 19

Query

Query = function ( all data )

Page 20: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop & Storm. #JaxLondon 20

Number of people living in each city.

Person Location Time

Nathan Antwerp 2005-01-01

Geert Dendermonde 2011-10-08

John Ghent 2010-05-02

Nathan Ghent 2013-02-03

Location Count

Ghent 2

Dendermonde 1

Page 21: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop & Storm. #JaxLondon 22

Query

All Data Query

Page 22: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop & Storm. #JaxLondon 23

Query: Precompute

All Data QueryPrecomputed

View

Page 23: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop & Storm. #JaxLondon 24

Layered Architecture

Speed Layer

Batch Layer

Serving Layer

Page 24: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop & Storm. #JaxLondon 25

Layered Architecture

HadoopElephant

DB

Qu

ery

Incoming Data

Cassandra

Page 25: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop & Storm. #JaxLondon 26

Batch Layer

Page 26: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop & Storm. #JaxLondon 27

Batch Layer

HadoopElephant

DB

Incoming Data

Page 27: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop & Storm. #JaxLondon 28

Unrestrained computation.

Batch Layer

Page 28: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop & Storm. #JaxLondon 29

No need to De-Normalize.

Batch Layer

Page 29: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop & Storm. #JaxLondon 30

Horizontal scalable.

Batch Layer

Page 30: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop & Storm. #JaxLondon 31

High Latency.

matter.

Batch Layer

Page 31: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop & Storm. #JaxLondon 32

Functional computation, based on immutable inputs, is idempotent.

Batch Layer

Page 32: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop & Storm. #JaxLondon 33

Stores master copy of data set...

Batch Layer

append only.

Page 33: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop & Storm. #JaxLondon 34

Batch Layer

Page 34: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop & Storm. #JaxLondon 35

Batch: View generation

Master Dataset

View #1

View #3

View #2MapReduce

Page 35: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop & Storm. #JaxLondon 36

1. Take a large data set and divide it into subsets

2. Perform the same function on all subsets

3. Combine the output from all subsets

Output

MA

PR

ED

UC

E

MapReduce

DoWork() DoWork() DoWork()…

Page 36: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop & Storm. #JaxLondon 37

Serialization & Schema

Catch errors as quickly as they happen. Validation on write vs on read.

Page 37: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop & Storm. #JaxLondon 38

Serialization & Schema

CSV is actually a serialization language that is just poorly defined.

Page 38: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop & Storm. #JaxLondon 39

Serialization & Schema

Use a format with a schema.- Thrift

- Avro

- Protobuffers

Page 39: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop & Storm. #JaxLondon 40

Read only database.No random writes required.

Batch View Database

Page 40: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop & Storm. #JaxLondon 41

Every iteration produces the Views from scratch.

Batch View Database

Page 41: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop & Storm. #JaxLondon 42

Batch View Database

ElephantDB

Splout

Voldemort

Page 42: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop & Storm. #JaxLondon 44

Batch Layer

Not yet absorbed.Data absorbed into Batch Views

Time No

w

Just a few hours of data.

Page 43: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop & Storm. #JaxLondon 45

Speed Layer

Page 44: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop & Storm. #JaxLondon 46

Overview

HadoopElephant

DB

Incoming Data

Cassandra

Page 45: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop & Storm. #JaxLondon 47

Stream processing.

Speed Layer

Page 46: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop & Storm. #JaxLondon 48

Continuous computation.

Speed Layer

Page 47: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop & Storm. #JaxLondon 49

Transactional.

Speed Layer

Page 48: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop & Storm. #JaxLondon 50

Storing a limited window of data.Compensating for the last few hours of data.

Speed Layer

Page 49: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop & Storm. #JaxLondon 51

All the complexity is isolated in the Speed layer.

-corrected.

Speed Layer

Page 50: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop & Storm. #JaxLondon 52

CAP

You have a choice between:

Availability- Queries are eventual consistent.

Consistency- Queries are consistent.

Page 51: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop & Storm. #JaxLondon 53

Some algorithms are hard to implement in real time. For those cases we could

estimate the results.

Eventual accuracy

Page 52: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop & Storm. #JaxLondon 54

Speed Layer

Incoming Data

Real Time

View 1

Real Time

View 2

Page 53: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop & Storm. #JaxLondon 55

Storm

Message passing.

Distributed processing.

Horizontally scalable.

Incremental algorithms.

Fast.

Data in motion.

Page 54: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop & Storm. #JaxLondon 56

Storm

Nimbus Zookeeper

Worker Node

Supervisor

Executer

Executer

Executer

Worker Node

Supervisor

Executer

Executer

Executer

Worker Node

Supervisor

Executer

Executer

Executer

Page 55: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop & Storm. #JaxLondon 57

Storm

Tuple

Stream

Page 56: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop & Storm. #JaxLondon 58

Storm

Spout

Bolt

Page 57: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop & Storm. #JaxLondon 59

Storm

Grouping

Page 58: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop & Storm. #JaxLondon 60

Data Ingestion

Kafka

Flume

Scribe

*MQ

Kestrel

Page 59: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop & Storm. #JaxLondon 61

Speed Layer Views

The views are stored in Read & Write database.- Cassandra

- Hbase

- Redis

- MySQL

- ElasticSearch

-

Much more complex than a read only view.

Page 60: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop & Storm. #JaxLondon 62

Serving Layer

Page 61: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop & Storm. #JaxLondon 63

Overview

HadoopElephant

DB

Qu

ery

Incoming Data

Cassandra

Page 62: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop & Storm. #JaxLondon 64

Random reads

Serving Layer

Page 63: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop & Storm. #JaxLondon 65

This layer queries the Batch & Real Time views and merges it.

Serving Layer

Page 64: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop & Storm. #JaxLondon 66

Serving Layer

Real Time Views

Merge

Batch Views

Page 65: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop & Storm. #JaxLondon 67

How to query an Average?

Serving Layer

Page 66: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop & Storm. #JaxLondon 68

Overview

Page 67: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop & Storm. #JaxLondon 69

Overview

HadoopElephant

DB

Qu

ery

Incoming Data

Cassandra

Page 68: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop & Storm. #JaxLondon 70

Lambda Architecture

Page 69: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop & Storm. #JaxLondon 71

Lambda Architecture

Can discard any view, batch and real time, and just recreate everything from the master

data.

Page 70: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop & Storm. #JaxLondon 72

Lambda Architecture

Mistakes are corrected via recomputation. Write bad data? Remove the data & recompute.

Bug in view generation? Just recompute the view.

Page 71: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop & Storm. #JaxLondon 73

Lambda Architecture

Data storage is highly optimized.

Page 72: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop & Storm. #JaxLondon 74

Lambda Architecture

Immutability changes everything.

Page 73: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop & Storm. #JaxLondon 75

Questions?

Questions?@nathan_gs & #BigDataCon13

Page 74: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop & Storm. #JaxLondon 76

DataCrunchers

We enable companies in envisioning, defining and implementing a data strategy.

A one-stop-shop for all your Big Data needs.

The first Big Data Consultancy agency in Belgium.

Page 75: A real-time architecture using Hadoop and Storm @ JAX London

A real-time architecture using Hadoop & Storm. #JaxLondon 77

Thank you

Thank you@nathan_gs