a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

A real-time architecture using Hadoop and Storm.

Nathan Bijnens

Speaker

Nathan BijnensDataCrunchers@nathan_gs

Our Vision

Big Data

Volume

Big Data

Velocity

Our Vision

Volume

Variety

Computing Trends

Source: Immutability Changes Everything - Pat Helland, RICON2012

Computation (CPUs) Expensive

Disk Storage Expensive

Coordination Easy(Latches Don’t Often Hit)

DRAM Expensive

Computation Cheap (Many Core Computers)

Disk Storage Cheap(Cheap Commodity Disks)

Coordination Hard(Latches Stall a Lot, etc)

DRAM / SSD Getting Cheap

Past Current

Credits

Nathan Marz•Ex-Backtype & Twitter

•Startup in Stealthmode

•Storm

•Cascalog

•ElephantDB

manning.com/marz

A Data System

Not all information is equal.

Some information is derived from other pieces of information.

Data is more than Information

Eventually you will reach the most ‘raw’ form of

information.This is the information you hold true,

simple because it exists.

Let’s call this ‘data’, very similar to ‘event’.

Data is more than Information

Events used to manipulate the master

Events - Before

Today, events are the master data.

Events - After

Let’s store everything.

Data System

Data is Immutable

Events

Data is Time Based

Events

Capturing change traditionally

Person Location

Nathan Antwerp

Geert Dendermonde

John Ghent

Person Location

Nathan Ghent

Geert Dendermonde

John Ghent

Capturing change

Person Location Time

Nathan Antwerp 2005-01-01

Geert Dendermonde 2011-10-08

John Ghent 2010-05-02

Nathan Ghent 2013-02-03

Person Location Timestamp

Geert Dendermonde

2011-10-08

The data you query is often transformed, aggregated, ...

Rarely used in it’s original form.

Query = function ( all data )

Number of people living in each city.

Person Location Time

Geert Dendermonde

2011-10-08

Nathan Ghent 2013-02-03

Location Count

Ghent 2

Dendermonde 1

All Data Query

Query: Precompute

All Data Query

Precomputed View

Layered Architecture

Speed Layer

Batch Layer

Serving Layer

Layered Architecture

HadoopElephan

Incoming Data

Cassandra

Batch Layer

HadoopElephan

Incoming Data

Unrestrained computation.

Batch Layer

No need to De-Normalize.

Batch Layer

Horizontal scalable.

Batch Layer

High Latency.Let’s pretend temporarily that update latency

doesn’t matter.

Batch Layer

Functional computation, based on immutable inputs, is idempotent.

Batch Layer

Stores master copy of data set...

Batch Layer

append only.

Batch Layer

Batch: View generation

Master Dataset

View #1

View #3

View #2

MapReduc

MapReduce

1. Take a large data set and divide it into subsets

2. Perform the same function on all subsets

3. Combine the output from all subsets

Output

MapReduce

DoWork() DoWork() DoWork()…

MapReduce

Serialization & Schema

Catch errors as quickly as they happen. Validation on write vs on read.

CSV is actually a serialization language that is just poorly defined.

•Use a format with a schema.

•Thrift

•Avro

•Protobuffers

•Added bonus: it’s faster & uses less space.

Read only database.No random writes required.

Batch View Database

Every iteration produces the Views from scratch.

Batch View Database

•ElephantDB

•Splout

•Voldemort

•…

Batch Layer

Not yet absorbe

d.Data absorbed into Batch Views

Time Now

We are not done yet…Just a few hours of data.

Speed Layer

Overview

HadoopElephan

Incoming Data

Cassandra

Stream processing.

Speed Layer

Continuous computation.

Speed Layer

Transactional.

Speed Layer

Storing a limited window of data.

Compensating for the last few hours of data.

Speed Layer

All the complexity is isolated in the Speed layer.

If anything goes wrong, it’s auto-corrected.

Speed Layer

You have a choice between:

•Availability

•Queries are eventual consistent.

•Consistency

•Queries are consistent.

Consistency

Availability

Partition Toleranc

Some algorithms are hard to implement in real time. For those

cases we could estimate the results.

Eventual accuracy

Speed Layer

Incoming Data

Real Time

Real Time

•Message passing.

•Distributed processing.

•Horizontally scalable.

•Incremental algorithms.

•Fast.

•Data in motion.

•Tuple

•Stream

•Spout

•Bolt

•Grouping

Data Ingestion

•Kafka

•Flume

•Scribe

•*MQ

•…

Speed Layer Views

•The views are stored in Read & Write database.

•Cassandra

•Hbase

•Redis

•MySQL

•ElasticSearch

•…

•Much more complex than a read only view.

Serving Layer

Overview

HadoopElephan

Incoming Data

Cassandra

Random reads

Serving Layer

This layer queries the Batch & Real Time views and merges it.

Serving Layer

Real Time Views

Batch Views

How to query an Average?

Serving Layer

Overview

HadoopElephan

Incoming Data

Cassandra

Source: martinfowler.com/bliki/CQRS.html – Martin Fowler

Lambda Architecture

Can discard any view, batch and real time, and just recreate everything

from the master data.

Lambda Architecture

Mistakes are corrected via recomputation.

Write bad data? Remove the data & recompute.

Bug in view generation? Just recompute the view.

Lambda Architecture

Data storage is highly optimized.

Lambda Architecture

Immutability changes everything.

Questions?

Questions?@nathan_gs & #DV13

slideshare.net/nathan_gs

nathan@datacrunchers.eu

DataCrunchers

We enable companies in envisioning, defining and implementing a data strategy.

A one-stop-shop for all your Big Data needs.

The first Big Data Consultancy agency in Belgium.

Thank you

Thank you@nathan_gs

nathan@datacrunchers.eu

a real-time architecture using Hadoop and Storm at Devoxx

Technology

Combining Real-time and Batch Analytics with NoSQL, Storm and Hadoop - NoSQL Matters, Cologne, 2014

Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop

Devoxx 2014 Monitoring

Devoxx 2014 presentation

IzPack at Devoxx 2010

JSR363 - Devoxx US

Devoxx 2012 (v2)

Vaadin roadmap-devoxx-2013

Scaling Apache Storm (Hadoop Summit 2015)

Scaling Apache Storm - Hadoop Summit 2014

Cassandra devoxx 2010

Devoxx UK Quickie 2015

Big Data Brief - National Oceanic and Atmospheric ... · Hadoop, MongoDB, Spark, Storm Hadoop, a free and open-source implementation of the Map/Reduce programming model, • processes

Devoxx Activiti in Action

Hadoop and Storm - AJUG talk

Jcp devoxx-2012

Show me the Money! Cost & Resource Tracking for Hadoop and Storm

Storm - Philly Emerging Tech · 2019-12-04 · Distributed and fault-tolerant realtime computation Storm. Basic info ... Storm Cluster Master node (similar to Hadoop JobTracker) Storm

Devoxx UK BOF session

A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne '14)