Download pdf - The Big Bad Data

Transcript
Page 1: The Big Bad Data

The Big Bad Data

or how to deal with data sources

Przemysław Pastuszka, Kraków, 17.10.2014

Page 2: The Big Bad Data

● Quick introduction to Ocado● First approach to Big Data and why we

ended up with Bad Data● Making things better - Unified data

architecture● Live demo (if there is time left)

What do I want to talk about?

Page 3: The Big Bad Data

Ocado intro

Ocado is the world's largest online-only grocery retailer, reaching over 70% of British households, shipping over 150,000 orders a week or 1.1M items a day.

Page 4: The Big Bad Data

Customer Fulfilment Center

Page 5: The Big Bad Data

Shop

Page 6: The Big Bad Data

How did we start?

Google Big Query

OC

AD

O S

ERVI

CES

Oracle

Green plum

JMS

Google Cloud

Storage

Compute cluster

User Cluster

Cluster Manager

Transformed ORC files

Raw data

Page 7: The Big Bad Data

Looks good. So what’s the problem?

● Various data formats○ json, csv, uncompressed, gzip, blob, nested…○ incremental data, “snapshot” data, deltas○ lots of code to handle all corner cases

● Corrupted data○ corrupted gzips, empty files, invalid content, unexpected

schema changes …○ failures of overnight jobs

● Data exports delayed○ DB downtime, network issues, human error, …○ data available in BD platform even later

Page 8: The Big Bad Data

Missing data

Page 9: The Big Bad Data

● Real-time analytics?○ data comes in batches every night○ so forget it

● ORC is not a dream solution○ not a very friendly format○ overnight transform is a next point of failure○ data duplication (raw + ORC)

● People think you “own” the data○ “please, tell me what this data means?”

That’s not all...

Page 10: The Big Bad Data

● Big Data team is frustrated○ we spend lots of time on monitoring and fixing bugs○ code becomes complex to handle corner cases○ confidence in platform stability rapidly goes down

● Analytics are frustrated○ long latency before data is available for querying○ data is unreliable

People get frustrated

Page 11: The Big Bad Data

Let’s go back to the board!

It can’t go on like this anymore

Page 12: The Big Bad Data

What we need?

● Unified input data format○ JSON to the rescue

● Data goes directly from applications to BD platform○ let’s make all external services write to distributed queue

● Data validation and monitoring○ all data incoming into system should be well-described○ validation should happen as early as possible○ alerts on corrupted / missing data must be raised early

● Data must be ready for querying early○ let’s push data to BigQuery first

Page 13: The Big Bad Data

New architecture overviewIN

PUT

STR

EAM

Event Registry

DataStorage

EventProcessorEvent

ProcessorEventProcessor

Compute Cloud

END

POIN

TS

Cluster Manager

Page 14: The Big Bad Data

Loading data

validatedeventsstream

INPUT STREAMKinesis

Event Registry

BQ

event type descriptor- schema - processing instructions

Google Cloud Storage

ad-hoc / scheduled

export

raweventsstream

BQ TableauConnector

BQ Excel Connector

BQ REST API

invalid events store location

( BQ )

invalidevents

ad hocevents replay

EventProcessor

GS REST APIgsutil

Page 15: The Big Bad Data

Batch processing

BQ

Google Cloud Storage

BQ TableauConnector

BQ Excel Connector

BQ REST API

BQ query

export

COMPUTE CLOUD

ComputeCluster A

ComputeCluster B

ComputeCluster C

GS REST APIgsutil

Page 16: The Big Bad Data

Real-time processing

INPUT STREAMKinesis

Event Registry

BQ

event type descriptor- schema

- processing instructions

Google Cloud Storage

validatedeventsstream

raweventsstream

BQ TableauConnector

BQ Excel Connector

BQ REST API

EVEN

T Q

UEU

E

BQ Sink

Cluster A

Cluster B

Event Processor

GS Sink

processed data readyfor consumption by

other

GS REST APIgsutil

Page 17: The Big Bad Data

Questions?


Recommended