51
Data Pipelines : Improving on the Lambda Architecture Brian O’Neill, CTO [email protected] m @boneill42

Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on the Lambda Architecture

Embed Size (px)

DESCRIPTION

This presentation covers our use of Storm and the connectors we've built. It also proposes a design for integrating Storm with real-time web services by embedding parts of topologies directly into the web services layer.

Citation preview

Page 1: Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on the Lambda Architecture

Data Pipelines : Improving on the Lambda Architecture

Brian O’Neill, [email protected]@boneill42

Page 2: Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on the Lambda Architecture

Talk Breakdown

29%

20%31%

20%

Topics

(1) Motivation(2) Polyglot Persistence(3) Analytics(4) Lambda Architecture

Page 3: Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on the Lambda Architecture

Health Market Science - Then

What we were.

Page 4: Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on the Lambda Architecture

Health Market Science - Now

Intersecting Big Data w/ Healthcare

We’re fixing healthcare!Dashboards

AnalyticsData Management

Web Services

(api.hmsonline.com)

Page 5: Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on the Lambda Architecture

Data Pipelines

I/O

Page 6: Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on the Lambda Architecture

The InputFrom government,state boards, etc.

From the internet,social data,networks / graphs

From third-parties,medical claims

From customers,expenses,sales data,beneficiary

information,quality scores

DataPipeline> 2,000+ so

urces

> 1B events / year

Page 7: Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on the Lambda Architecture

The Output

Script

Claims

Expense

Sanction

Address

Contact(phone, fax, etc.)

Drug

RepresentativeDivision

Expense ManagerTM

Provider Verification™

MarketViewTM

CustomerFeed(s)

CustomerMaster

Provider MasterFileTM

Credentials

“Agile MDM”1 billion claims

per year

Organization

Practitioner

Referrals

> 8M practitioners

> 5B claims

> 5 years of histo

ry

Page 8: Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on the Lambda Architecture

Sounds easy

Except...Incomplete CaptureNo foreign keysDiffering schemasChanging schemasConflicting informationAd-hoc Analysis (is hard)Point-In-Time Retrieval

Page 9: Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on the Lambda Architecture

Golden

Record

Master Data Management

Harvested

Government

Private

Schema Change!

Page 10: Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on the Lambda Architecture

Why?

?’s

Is this doctor,Licensed?Sanctioned?Influential?

Is this e

xpense,

Legal?

Compliant?

Reported?

Is this claim,

Fraudulent?

Wasteful?

Abusive?

Compliance & Safety

Sales Operations

Cost Control

Transparency

Is this market,

Saturated?

Penetrable?Marketing Optimization

Page 11: Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on the Lambda Architecture

Our MDM Pipeline

- Data Stewardship- Data Scientists- Business Analysts

Ingestion- Semantic Tagging- Standardization- Data Mapping

Incorporation- Consolidation- Enumeration- Association

Insight- Search- Reports- Analytics

Feeds(multiple formats,

changing over time)API / FTP Web Interface

DimensionsLogicRules

Page 12: Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on the Lambda Architecture

Our first “Pipeline”

+

Page 13: Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on the Lambda Architecture

Sweet!

Dirt SimpleLightning Fast

Highly AvailableScalable

Multi-Datacenter (DR)

Page 14: Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on the Lambda Architecture

Not Sweet.

How do we query the data?NoSQL Indexes?

Do such things exist?

Page 15: Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on the Lambda Architecture

Rev. 1 – Wide Rows!

AOPTriggers!Data model to

support your queries.

9 7 32 74 99 12 42

$3.50 $7.00 $8.75 $1.00 $4.20 $3.17 $8.88

ONC : PA : 19460

D’Oh! What about ad hoc?

Page 16: Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on the Lambda Architecture

Transformation

Rev 2 – Elastic Search!

AOPTriggers!

D’Oh!What if ES fails?What about schema / type

information?

Page 17: Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on the Lambda Architecture

Rev 3 - Apache Storm!

Page 18: Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on the Lambda Architecture

Polyglot Persistence“The Right Tool for the Job”

Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners.

Page 19: Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on the Lambda Architecture

Back to the Pipeline

KafkaDW

Storm

C* ES Titan SQL

Page 20: Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on the Lambda Architecture

Design Principles

• What we got:– At-least-once processing– Simple data flows

• What we needed to account for:– Replays

Idempotent Operations!Immutable Data! FTW!

Page 22: Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on the Lambda Architecture

Trident Elastic Search (v0.3.1)

[email protected]:hmsonline/trident-elasticsearch.git

{tuple} <mapper> (idx, docid, k:v[])Storm Elastic Search

Page 23: Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on the Lambda Architecture

Storm Graph (v0.1.2)

Coming soon [email protected]:hmsonline/storm-graph.git

for (tuple : batch)<processor> (graph, tuple)

Page 24: Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on the Lambda Architecture

Storm JDBI (v0.1.14)

INTERNAL ONLY (so far)Worth releasing?

{tuple} <mapper> (JDBC Statement)

Page 25: Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on the Lambda Architecture

All good!

Page 26: Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on the Lambda Architecture

But...

What was the average amount for a medical claim associated with procedure X by zip code over the last five years?

Page 27: Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on the Lambda Architecture

Hadoop (<2)? Batch?

Yuck. ‘Nuff Said.http://www.slideshare.net/prash1784/introduction-to-hadoop-and-pig-15036186

Page 28: Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on the Lambda Architecture

Let’s Pre-Compute It!

stream.groupBy(new Field(“ICD9”)).groupBy(new Field(“zip”)).aggregate(new Field(“amount”),

new Average())

D’Oh! GroupBy’s.They set data in motion!

Page 29: Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on the Lambda Architecture

Lesson Learned

https://github.com/nathanmarz/storm/wiki/Trident-API-Overview

If possible, avoid re-partitioning operations!

(e.g. LOG.error!)

Page 30: Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on the Lambda Architecture

Why so hard?

D’Oh!

19 != 9

What we don’t want:LOCKS!

What’s the alternative?CONSENSUS!

Page 32: Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on the Lambda Architecture

Conditional Updates

“The alert reader will notice here that Paxos gives us the ability to agree on exactly one proposal. After one has been accepted, it will be returned to future leaders in the promise, and the new leader will have to re-propose it again.”http://www.datastax.com/dev/blog/lightweight-transactions-in-cassandra-2-0

UPDATE value=9 WHERE word=“fox” IF value=6

Page 33: Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on the Lambda Architecture

Love CQL

Conditional Updates+

Batch Statements+

Collections=

BADASS DATA MODELS

Page 34: Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on the Lambda Architecture

Announcing : Storm Cassandra CQL!

[email protected]:hmsonline/storm-cassandra-cql.git

{tuple} <mapper> (CQL Statement)

Trident Batching =? CQL Batching

Page 35: Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on the Lambda Architecture

CassandraCqlStatepublic void commit(Long txid) {

BatchStatement batch = new BatchStatement(Type.LOGGED); batch.addAll(this.statements); clientFactory.getSession().execute(batch); }

public void addStatement(Statement statement) { this.statements.add(statement); } public ResultSet execute(Statement statement){ return clientFactory.getSession().execute(statement); }

Page 36: Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on the Lambda Architecture

CassandraCqlStateUpdater

public void updateState(CassandraCqlState state, List<TridentTuple> tuples, TridentCollector collector) {

for (TridentTuple tuple : tuples) { Statement statement = this.mapper.map(tuple); state.addStatement(statement); } }

Page 37: Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on the Lambda Architecture

ExampleMapperpublic Statement map(List<String> keys, Number value) {

Insert statement = QueryBuilder.insertInto(KEYSPACE_NAME, TABLE_NAME);statement.value(KEY_NAME, keys.get(0));statement.value(VALUE_NAME, value);return statement;

}

public Statement retrieve(List<String> keys) {Select statement = QueryBuilder.select().column(KEY_NAME).column(VALUE_NAME).from(KEYSPACE_NAME, TABLE_NAME).where(QueryBuilder.eq(KEY_NAME, keys.get(0)));

return statement;}

Page 38: Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on the Lambda Architecture

Incremental State!

• Collapse aggregation into the state object.– This allows the state object to aggregate with current state in a loop

until success.

• Uses Trident Batching to perform in-memory aggregation for the batch.

for (tuple : batch)state.aggregate(tuple);

while (failed?) {persisted_state = read(state)aggregate(in_memory_state, persisted_state)failed? = conditionally_update(state)}

Page 39: Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on the Lambda Architecture

Partition 1

In-Memory Aggregation by Key!

Key Value

fox 6

brown 3

Partition 2

Key Value

fox 3

lazy 72C*

No More GroupBy!

Page 40: Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on the Lambda Architecture

To protect against replays

Use partition + batch identifier(s) in your conditional update!

“BatchId + partitionIndex consistently represents the same data as long as:

1.Any repartitioning you do is deterministic (so partitionBy is, but shuffle is not)

2.You're using a spout that replays the exact same batch each time (which is true of transactional spouts but not of opaque transactional spouts)”

- Nathan Marz

Page 41: Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on the Lambda Architecture

The Lambda Architecture

http://architects.dzone.com/articles/nathan-marzs-lamda

Page 42: Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on the Lambda Architecture

Let’s Challenge This a Bit

because “additional tools and techniques” cost money and time.

• Questions:– Can we solve the problem with a single tool and a

single approach?– Can we re-use logic across layers?– Or better yet, can we collapse layers?

Page 43: Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on the Lambda Architecture

A Traditional Interpretation

Speed Layer(Storm)

Batch Layer(Hadoop)

DataStream

Serving Layer

HBase

Impala

D’Oh! Two pipelines!

Page 44: Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on the Lambda Architecture

Integrating Web Services

• We need a web service that receives an event and provides,– an immediate acknowledgement– a high likelihood that the data is integrated very soon– a guarantee that the data will be integrated

eventually

• We need an architecture that provides for,– Code / Logic and approach re-use– Fault-Tolerance

Page 45: Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on the Lambda Architecture

Grand Finale

Page 46: Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on the Lambda Architecture

The Idea : Embedding State!

KafkaDropWizard

C*

IncrementalCqlStateaggregate(tuple)

“Batch” Layer(Storm)

Client

Page 47: Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on the Lambda Architecture

The Sequence of Events

Page 48: Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on the Lambda Architecture

The Wins

• Reuse Aggregations and State Code!• To re-compute (or backfill) a dimension,

simply re-queue!• Storm is the “safety” net

– If a DW host fails during aggregation, Storm will fill in the gaps for all ACK’d events.

• Is there an opportunity to reuse more?– BatchingStrategy & PartitionStrategy?

Page 49: Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on the Lambda Architecture

In the end, all good. =)

Page 51: Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on the Lambda Architecture

Thanks

Brian O’Neill, [email protected]@boneill42