17
The Big Bad Data or how to deal with data sources Przemysław Pastuszka, Kraków, 17.10.2014

The Big Bad Data

Embed Size (px)

DESCRIPTION

Presentation shows how we started doing Big Data in Ocado, what obstacles we hit and how we tried to fix this later. You'll see how to deal with data sources, or most importatly, how not to deal with them.

Citation preview

Page 1: The Big Bad Data

The Big Bad Data

or how to deal with data sources

Przemysław Pastuszka, Kraków, 17.10.2014

Page 2: The Big Bad Data

● Quick introduction to Ocado● First approach to Big Data and why we

ended up with Bad Data● Making things better - Unified data

architecture● Live demo (if there is time left)

What do I want to talk about?

Page 3: The Big Bad Data

Ocado intro

Ocado is the world's largest online-only grocery retailer, reaching over 70% of British households, shipping over 150,000 orders a week or 1.1M items a day.

Page 4: The Big Bad Data

Customer Fulfilment Center

Page 5: The Big Bad Data

Shop

Page 6: The Big Bad Data

How did we start?

Google Big Query

OC

AD

O S

ERVI

CES

Oracle

Green plum

JMS

Google Cloud

Storage

Compute cluster

User Cluster

Cluster Manager

Transformed ORC files

Raw data

Page 7: The Big Bad Data

Looks good. So what’s the problem?

● Various data formats○ json, csv, uncompressed, gzip, blob, nested…○ incremental data, “snapshot” data, deltas○ lots of code to handle all corner cases

● Corrupted data○ corrupted gzips, empty files, invalid content, unexpected

schema changes …○ failures of overnight jobs

● Data exports delayed○ DB downtime, network issues, human error, …○ data available in BD platform even later

Page 8: The Big Bad Data

Missing data

Page 9: The Big Bad Data

● Real-time analytics?○ data comes in batches every night○ so forget it

● ORC is not a dream solution○ not a very friendly format○ overnight transform is a next point of failure○ data duplication (raw + ORC)

● People think you “own” the data○ “please, tell me what this data means?”

That’s not all...

Page 10: The Big Bad Data

● Big Data team is frustrated○ we spend lots of time on monitoring and fixing bugs○ code becomes complex to handle corner cases○ confidence in platform stability rapidly goes down

● Analytics are frustrated○ long latency before data is available for querying○ data is unreliable

People get frustrated

Page 11: The Big Bad Data

Let’s go back to the board!

It can’t go on like this anymore

Page 12: The Big Bad Data

What we need?

● Unified input data format○ JSON to the rescue

● Data goes directly from applications to BD platform○ let’s make all external services write to distributed queue

● Data validation and monitoring○ all data incoming into system should be well-described○ validation should happen as early as possible○ alerts on corrupted / missing data must be raised early

● Data must be ready for querying early○ let’s push data to BigQuery first

Page 13: The Big Bad Data

New architecture overviewIN

PUT

STR

EAM

Event Registry

DataStorage

EventProcessorEvent

ProcessorEventProcessor

Compute Cloud

END

POIN

TS

Cluster Manager

Page 14: The Big Bad Data

Loading data

validatedeventsstream

INPUT STREAMKinesis

Event Registry

BQ

event type descriptor- schema - processing instructions

Google Cloud Storage

ad-hoc / scheduled

export

raweventsstream

BQ TableauConnector

BQ Excel Connector

BQ REST API

invalid events store location

( BQ )

invalidevents

ad hocevents replay

EventProcessor

GS REST APIgsutil

Page 15: The Big Bad Data

Batch processing

BQ

Google Cloud Storage

BQ TableauConnector

BQ Excel Connector

BQ REST API

BQ query

export

COMPUTE CLOUD

ComputeCluster A

ComputeCluster B

ComputeCluster C

GS REST APIgsutil

Page 16: The Big Bad Data

Real-time processing

INPUT STREAMKinesis

Event Registry

BQ

event type descriptor- schema

- processing instructions

Google Cloud Storage

validatedeventsstream

raweventsstream

BQ TableauConnector

BQ Excel Connector

BQ REST API

EVEN

T Q

UEU

E

BQ Sink

Cluster A

Cluster B

Event Processor

GS Sink

processed data readyfor consumption by

other

GS REST APIgsutil

Page 17: The Big Bad Data

Questions?