Next Generation Data Platforms - Deon Thomas

NEXT GENERATION DATA PLATFORMS!

Beyond the traditional view

What is Big Data ?

A new generation of technologies and architectures designed to economically extract value from very large volumes of a wide variety of data, by enabling high velocity capture,

discovery and/or analysis.

•  Terabytes •  Records •  Tables, •  Files

•  Structured •  Unstructured •  Semi-Structured •  All the above

•  Batch •  Realtime •  Streams •  Near Realtime

Velocity

Variety

Volume

Volume! + Velocity =

Value

+ Variety

Beyond the traditional view

1.  Variety!

Available

Consistency

Partition Tolerance

Mongo HBase Redis

CP

RDBMS

CA

AP

CouchDB Cassandra DynamoDB Riak

Brewer’s CAP Theorem

NoSQL ●  no ACID transactions ●  sharded indexes ●  restricted Joins ●  support columnar storage!

In memory DB ●  real time transactions ●  not fully geared for

enterprise level data ●  variety of indexes ●  complex joins

HDFS GDF HBASE

Database evolution

Hadoop - Has become synonymous with Big Data.!

MapReduce!

Map Shuffle Reduce

2. Velocity!

https://www.flickr.com/photos/lopetz/3912416793/

REAL TIME BATCH

Real Time vs Batch(MapReduce)!

15

ACHIEVING VELOCITY (Parallel computing)

Shared Memory

Processor 1

Shared Data

Lock (Shared Data)

Worker (Shared Data)

Unlock (Shared Data)

ACHIEVING VELOCITY (Parallel computing)

Shared Data

Chunk 1 Chunk 2 Chunk 3

Message Passing (E.g. MapReduce)

Processor 1 Processor 2 Processor 3

3. Volume!

4. Value!

Analytics

Pull-based Batch Loads

Enterprise Data Models

Complex ETL Logic

Poorly Suited to

Non-Relational Data

Emergent design is difficult

Conventional Architectures!

OLAP (Online Analytical processing)

SELECT SUM(s.dollar_cost), s.product_key, p.description FROM SALES_FACT s

… … …

GROUP BY s.product_key, p.description

Why Databases!

●  Transaction processing (ACID properties) ●  SQL - Indexes and queries OLAP ●  Transaction processing not needed for analytics

o  Moving of data via ETL ●  Large volumes of data, indexes become irrelevant ●  Schema or Write vs Schema Read!

Analytics is not just SQL queries

There is more to analytics than you think!

Do we lose hope, how do we move forward!

1.  Variety - How to we deal with different kinds of data ?

2.  Volume - How to we cope with large volume of data?

3.  Velocity - How do we solve realtime problems?

4.  Value - What is our value ?!

Summary!

Lambda Architecture!

Lambda Architecture!

HDFS

(Hadoop)

MapReduce

HBASE

(Storm)

ElephantDB

Solr

Solr

Solr

Real time processing + Batch since 1983!

Should i adopt or should i not!

“Change before you have to”

Jack Welch

Questions

thoughtworks.com

@DuffleDoe

http://deonthomas.blogspot.com

we’re hiring ...

Data & Analytics

Next Generation Data Platforms - Deon Thomas