Data Architectures for Robust Decision Making

Designing Data Architectures for Robust Decision Making

Gwen Shapira / Software Engineer

• 15 years of moving data around

• Formerly consultant

• Now Cloudera Engineer:– Sqoop Committer

– Kafka

– Flume

• @gwenshap

About Me

There’s a book on that!

About you:

You know Hadoop

“Big Data” is stuck at The Lab.

We want to move to The Factory

7Click to enter confidentiality information

What does it mean to “Systemize”?

• Ability to easily add new data sources

• Easily improve and expend analytics

• Ease data access by standardizing metadata and storage

• Ability to discover mistakes and to recover from them

• Ability to safely experiment with new approaches

Click to enter confidentiality information

We will discuss:

• Actual decision making

• Data Science

• Machine learning

• Algorithms

We will not discuss:

• Architectures

• Patterns

• Ingest

• Storage

• Schemas

• Metadata

• Streaming

• Experimenting

• Recovery

So how do we build real data architectures?

The Data Bus

Client Source

Data Pipelines Start like this.

Client Source

Client

Then we reuse them

Client Backend

Client

Then we add consumers to the

existing sources

Another

Backend

Client Backend

Client

Then it starts to look like this

Another

Backend

Another

Backend

Another

Backend

Client Backend

Client

With maybe some of this

Another

Backend

Another

Backend

Another

Backend

Adding applications should be easier

We need:

• Shared infrastructure for sending records

• Infrastructure must scale

• Set of agreed-upon record schemas

Kafka Based Ingest Architecture

Source System Source System Source System Source System

Kafka decouples Data Pipelines

HadoopSecurity

Systems

Real-time

monitoring

Warehouse

Producer

Brokers

Consume

Kafka decouples Data Pipelines

Retain All Data

Data Pipeline – Traditional View

Raw data

Raw data Clean data

Aggregated dataClean data Enriched data

Input OutputWaste of

diskspace

It is all valuable data

Raw data

Raw data Clean data

Aggregated dataClean data Enriched data

Filtered dataDash

boardReport

scientis

Alerts

Hadoop Based ETL – The FileSystem is the

/user/…

/user/gshapira/testdata/orders

/data/<database>/<table>/<partition>

/data/<biz unit>/<app>/<dataset>/partition

/data/pharmacy/fraud/orders/date=20131101

/etl/<biz unit>/<app>/<dataset>/<stage>

/etl/pharmacy/fraud/orders/validated

Store intermediate data

/etl/<biz unit>/<app>/<dataset>/<stage>/<dataset_id>

/etl/pharmacy/fraud/orders/raw/date=20131101

/etl/pharmacy/fraud/orders/deduped/date=20131101

/etl/pharmacy/fraud/orders/validated/date=20131101

/etl/pharmacy/fraud/orders_labs/merged/date=20131101

/etl/pharmacy/fraud/orders_labs/aggregated/date=20131101

/etl/pharmacy/fraud/orders_labs/ranked/date=20131101

Batch ETL is old news

Small Problem!

• HDFS is optimized for large chunks of data

• Don’t write individual events of micro-batches

• Think 100M-2G batches

• What do we do with small events?

Well, we have this data bus…

0 1 2 3 4 5 6 7 8 91

Partition 1

Partition 2

Partition 3

Writes

Old New

Kafka has topics

How about?

pharmacy.fraud.orders.raw

pharmacy.fraud.orders.deduped

pharmacy.fraud.orders.validated

pharmacy.fraud.orders_labs.merged

It’s (almost) all topics

Raw data

Raw data Clean data

Aggregated dataClean data

Filtered dataDash

boardReport

scientis

Alerts

Enriched Data

Benefits

• Recover from accidents

• Debug suspicious results

• Fix algorithm errors

• Experiment with new algorithms

• Expend pipelines

• Jump-start expended pipelines

Kinda Lambda

Lambda Architecture

• Immutable events

• Store intermediate stages

• Combine Batches and Streams

• Reprocessing

What we don’t like

Maintaining two applications

Often in two languages

That do the same thing

Pain Avoidance #1 – Use Spark +

SparkStreaming

• Spark is awesome for batch, so why not?– The New Kid that isn’t that New Anymore

– Easily 10x less code

– Extremely Easy and Powerful API

– Very good for machine learning

– Scala, Java, and Python

– RDDs

– DAG Engine

Spark Streaming

• Calling Spark in a Loop

• Extends RDDs with DStream

• Very Little Code Changes from ETL to Streaming

Confidentiality Information Goes Here

Spark Streaming

Confidentiality Information Goes Here

Single Pass

Source Receiver RDD

Filter Count Print

Source Receiver RDD

Single Pass

Filter Count Print

Pre-first

Second

Small Example

val sparkConf = new SparkConf()

.setMaster(args(0)).setAppName(this.getClass.getCanonicalName)

val ssc = new StreamingContext(sparkConf, Seconds(10))

// Create the DStream from data sent over the network

val dStream = ssc.socketTextStream(args(1), args(2).toInt, StorageLevel.MEMORY_AND_DISK_SER)

// Counting the errors in each RDD in the stream

val errCountStream = dStream.transform(rdd => ErrorCount.countErrors(rdd))

val stateStream = errCountStream.updateStateByKey[Int](updateFunc)

errCountStream.foreachRDD(rdd => {

System.out.println("Errors this minute:%d".format(rdd.first()._2))

Pain Avoidance #2 – Split the Stream

Why do we even need stream + batch?

• Batch efficiencies

• Re-process to fix errors

• Re-process after delayed arrival

What if we could re-play data?

Lets Re-Process with new algorithm

0 1 2 3 4 5 6 7 8 91

Streaming App v1

Streaming App v2

Result set 1

Result set 2

Lets Re-Process with new algorithm

0 1 2 3 4 5 6 7 8 91

Streaming App v1

Streaming App v2

Result set 1

Result set 2

Oh no, we just got a bunch of data for

yesterday!

0 1 2 3 4 5 6 7 8 91

Streaming App

Yesterday

No need to choose between the approaches.

There are good reasons to do both.

Prediction:

Batch vs. Streaming distinction is going away.

Yes, you really need a Schema

Schema is a MUST HAVE for data integration

Client Backend

Client

Another

Backend

Another

Backend

Another

Backend

Remember that we want this?

HadoopSecurity

Systems

Real-time

monitoring

Warehouse

Producer

Brokers

Consume

This means we need this:

HadoopSecurity

Systems

Real-time

monitoring

Warehouse

KafkaSchema

Repository

We can do it in few ways

• People go around asking each other:“So, what does the 5th field of the messages in topic Blah contain?”

• There’s utility code for reading/writing messages that everyone reuses

• Schema embedded in the message

• A centralized repository for schemas– Each message has Schema ID

– Each topic has Schema ID

I Avro

• Define Schema

• Generate code for objects

• Serialize / Deserialize into Bytes or JSON

• Embed schema in files / records… or not

• Support for our favorite languages… Except Go.

• Schema Evolution– Add and remove fields without breaking anything

Schemas are Agile

• Leave out MySQL and your favorite DBA for a second

• Schemas allow adding readers and writers easily

• Schemas allow modifying readers and writers independently

• Schemas can evolve as the system grows

• Allows validating data soon after its written– No need to throw away data that doesn’t fit!

51Click to enter confidentiality information

Woah, that was lots of stuff!

Recap – if you remember nothing else…

• After the POC, its time for production

• Goal: Evolve fast without breaking things

For this you need:

• Keep all data

• Design pipeline for error recovery – batch or stream

• Integrate with a data bus

• And Schemas

Thank you

Data Architectures for Robust Decision Making

Engineering

WSC-Category 2 Collaborative: Robust Decision-Making For

Robust optimization based decision making in Energy systems

Advanced Decision Architectures Collaborative Technology Alliance Rama Chellappa

Comparing Decision Making Using Expected Utility, Robust

Many Objective Robust Decision Making for Environmental ... · 8/31/2016 · Many Objective Robust Decision Making for Environmental and Water Systems Under Uncertainty ... Initial

Responses of Decision Making Teams to Adaptive Architectures: … · 2018-02-09 · Responses of Decision Making Teams to Adaptive Architectures Introduction Since its inception in

Six System Architectures With Robust Reverse Battery

Data Architectures for Robust Decision Making

PORTFOLIO DECISION ANALYSIS FOR ROBUST PROJECT SELECTION

Robust Decision Trees Against Adversarial Examplesweb.cs.ucla.edu/~chohsieh/ICML_2019_TreeAdvAttack.pdf · 2019-05-14 · Robust Decision Trees Against Adversarial Examples critical

Robust Optimization with Decision-Dependent …Robust Optimization with Decision-Dependent Information Discovery Phebe Vayanos University of Southern California, Center for Arti cial

ROBUST DECISION MAKING AND ITS APPLICATIONS IN …guppy.mpe.nus.edu.sg › ~mpexuh › papers › xu_PhdThesis.pdf · ROBUST DECISION MAKING AND ITS APPLICATIONS IN MACHINE LEARNING

198- ADVANCED DECISION ARCHITECTURES FOR THE WARFIGHTER.pdf

Robust decision making in uncertain environments Henry Brighton

Towards Designing Robust QCA Architectures in the - HAL

Robust and Data-Driven Optimization: Modern Decision

Robust and Data-Driven Optimization: Modern Decision ...web.mit.edu/dbertsim/www/papers/Robust Optimization... · Robust and Data-Driven Optimization: Modern Decision-Making Under

Advancd Registry Architectures Robust, Reliable, and Resilient Registry Operations Advancd Registry Architectures Robust, Reliable, and Resilient Registry

Qualitative and Quantitative Robust Decision Making …blogs.exeter.ac.uk/brim/files/2016/12/Quantitative-and-qualitative... · Qualitative and Quantitative Robust Decision Making

The Decision Maker’s Guide to Robust, Reliable and