The Big Bad Data

or how to deal with data sources

Przemysław Pastuszka, Kraków, 17.10.2014

● Quick introduction to Ocado● First approach to Big Data and why we

ended up with Bad Data● Making things better - Unified data

architecture● Live demo (if there is time left)

What do I want to talk about?

Ocado intro

Ocado is the world's largest online-only grocery retailer, reaching over 70% of British households, shipping over 150,000 orders a week or 1.1M items a day.

Customer Fulfilment Center

How did we start?

Google Big Query

Oracle

Green plum

Google Cloud

Storage

Compute cluster

User Cluster

Cluster Manager

Transformed ORC files

Raw data

Looks good. So what’s the problem?

● Various data formats○ json, csv, uncompressed, gzip, blob, nested…○ incremental data, “snapshot” data, deltas○ lots of code to handle all corner cases

● Corrupted data○ corrupted gzips, empty files, invalid content, unexpected

schema changes …○ failures of overnight jobs

● Data exports delayed○ DB downtime, network issues, human error, …○ data available in BD platform even later

Missing data

● Real-time analytics?○ data comes in batches every night○ so forget it

● ORC is not a dream solution○ not a very friendly format○ overnight transform is a next point of failure○ data duplication (raw + ORC)

● People think you “own” the data○ “please, tell me what this data means?”

That’s not all...

● Big Data team is frustrated○ we spend lots of time on monitoring and fixing bugs○ code becomes complex to handle corner cases○ confidence in platform stability rapidly goes down

● Analytics are frustrated○ long latency before data is available for querying○ data is unreliable

People get frustrated

Let’s go back to the board!

It can’t go on like this anymore

What we need?

● Unified input data format○ JSON to the rescue

● Data goes directly from applications to BD platform○ let’s make all external services write to distributed queue

● Data validation and monitoring○ all data incoming into system should be well-described○ validation should happen as early as possible○ alerts on corrupted / missing data must be raised early

● Data must be ready for querying early○ let’s push data to BigQuery first

New architecture overviewIN

Event Registry

DataStorage

EventProcessorEvent

ProcessorEventProcessor

Compute Cloud

Cluster Manager

Loading data

validatedeventsstream

INPUT STREAMKinesis

Event Registry

event type descriptor- schema - processing instructions

Google Cloud Storage

ad-hoc / scheduled

export

raweventsstream

BQ TableauConnector

BQ Excel Connector

BQ REST API

invalid events store location

( BQ )

invalidevents

ad hocevents replay

EventProcessor

GS REST APIgsutil

Batch processing

BQ TableauConnector

BQ Excel Connector

BQ REST API

BQ query

export

COMPUTE CLOUD

ComputeCluster A

ComputeCluster B

ComputeCluster C

GS REST APIgsutil

Real-time processing

INPUT STREAMKinesis

Event Registry

event type descriptor- schema

- processing instructions

validatedeventsstream

raweventsstream

BQ TableauConnector

BQ Excel Connector

BQ REST API

BQ Sink

Cluster A

Cluster B

Event Processor

GS Sink

processed data readyfor consumption by

GS REST APIgsutil

Questions?

The Big Bad Data

Data & Analytics

The Big Bad Lich

Bad Data is Polluting Big Data

Barry’s Big Bad Trivia Quiz

The Big Bad Bully By: Matt McConahy. The Big Bad Bully By: Matt McConahy

Big, Bad Mothertrucker

Lil Red & Big Bad

Big, Bad, Business Metrics

Big data, bad data -- Closing keynote at the Open World Forum 2013

the big bad f-word

Radware Bot Manager The Big Bad Bot Problem 2020 · 2020-03-31 · Methodology and Sources Radware’s Data Lake of Bots Radware Bot Management Expert Team The Big Bad Bot Problem

The Big Bad Marketing Brochure

A BAD Demonstration: Towards Big Active Data · for Big Active Data management due to limits in their data and query facilities. These key requirements are: 1. Incoming data items

‘Bad Data’ Is Polluting Big Data · 2016-10-18 · your big data flows? Top 3 Challenges for Big Data Flows are Quality, Security and Reliable Operation 1% 32% 40% 47% 52% 60%

Big Data in the Platform Economy - IPdigIT...Big Data in the Platform Economy Brussels, May 13 2016 2 Today’s talk! Big Data enhances price discrimination ! Bad news for consumers?

This Isn't 'Big Data.' It's Just Bad Data

Breaking Big: When Big Data Goes Bad - Bitpipedocs.media.bitpipe.com/io_11x/.../wp_breaking_big_when_big_data...Breaking Big: When Big Data Goes Bad ... social networking activity,

Big Bad Wolf

BIG, BAD AND POWERFUL

The Big Bad Bully

The Big Bad Mute Swan