44
Lambda Architecture Analyzing large scale, unstructured, dynamic data Rajesh Muppalla (@codingnirvana) [email protected]

Lambda architecture @ Indix

Embed Size (px)

DESCRIPTION

Slides from my presentation on Lambda Architecture at Indix, presented at Fifth Elephant 2014. It talks about our experience in using Lambda Architecture at Indix, to build a large scale analytics system on unstructured, dynamically changing data sources using Hadoop, HBase, Scalding, Spark and Solr.

Citation preview

Page 1: Lambda architecture @ Indix

Lambda Architecture

Analyzing large scale, unstructured, dynamic data

Rajesh Muppalla (@codingnirvana)[email protected]

Page 2: Lambda architecture @ Indix

Indix - Quick Overview

Am I priced higher or lower w.r.t my competitor on Nikon D700?

Which product has the UPC - 8745354434?

What are all the variants of Apple Macbook Air 13”? What is the average price change of all Nike Shoes

in Walmart in the last 3 months?

Page 3: Lambda architecture @ Indix

Data Pipeline @ Indix

C

Crawling Parsing

ML Model

ML Model

Classification

C1 C1 C1 C1

C2 C2 C2

C2 C2

Matching

Product & Price Catalog

Page 4: Lambda architecture @ Indix

Data Pipeline @ Indix

Analytics(Precomputes,

Insights)

Search Index

Product & Price Catalog

Experiences

We released the v1.0 of our API today - developer.indix.com

Page 5: Lambda architecture @ Indix

Data is Dynamic

CC1 C1 C1 C1

C2 C2 C2

C2 C2

ML Model

ML Model(new)

Crawling Parsing Classification Matching

Page 6: Lambda architecture @ Indix

Data Scale

400 MProduct

URLs4 TB

HTML Data Crawled

Daily

100 TB Data

Processed Daily

3000Categories

10 BPrice

Points

2000Sites

Page 7: Lambda architecture @ Indix

Data Pipeline v1.0

Page 8: Lambda architecture @ Indix

Batch using HBase & MapReduce

Page 9: Lambda architecture @ Indix

Problem 1

Data Systems should be Human Fault Tolerant

Mutable State

Page 10: Lambda architecture @ Indix

Problem 2

Compactions

Random Write databases are hard to manage at large scale

Page 11: Lambda architecture @ Indix

Problem 3

16 hours

16 hours latency is a lot. We wanted it to be couple of hours

Page 12: Lambda architecture @ Indix

Three Problems

● No Human Fault Tolerance○ Mutable State

● Operational Complexity○ Random Writes (Compactions)

● High Latency○ Batch system architectural tradeoff

Page 13: Lambda architecture @ Indix

Rethink our data systems

Page 14: Lambda architecture @ Indix

Lambda Architecture

Page 15: Lambda architecture @ Indix

Lambda Architecture

● An approach to build big data systems○ Architectural Components & Principles○ Ties Batch & Real Time Systems○ General Purpose - Domain Agnostic

● Coined by Nathan Marz○ Ex-Twitter Engineer○ Creator of Storm

Page 16: Lambda architecture @ Indix

HBase

Data System - Traditional Approach

Application

Source of Truth

Page 17: Lambda architecture @ Indix

Data System - New Approach

ImmutableRawData

ApplicationProcessed

View(s)

Source of Truth

Page 18: Lambda architecture @ Indix

Let’s take an example

Find the count of unique products in any given category for the entire time range

Page 19: Lambda architecture @ Indix

Two Requirements

● Recomputations● Large Scale

Page 20: Lambda architecture @ Indix

Batch Layer Implementation

C1 5

C2 7

C3 4

C4 7

C5 1

HDFS (Vertical Partitioning) HBase

Products Master Data

9 am

10 am

11 am

12 pm

1 pm

2 pm

Query

Intermediate view

C1

C2

C3

C4

C5

MR Job 1

Batch View

MR Job 2New Data

Page 21: Lambda architecture @ Indix

Handling Recomputations

C1 5

C2 7

C3 4

C4 7

C5 1

HDFS (Vertical Partitioning) HBase

Products Master Data

9 am

10 am

11 am

12 pm

1 pm

2 pm

Query

Intermediate view

C1

C2

C3

C4

C5

MR Job 1

Batch View

MR Job 2New Data

Page 22: Lambda architecture @ Indix

Handling Scale

● Hadoop HDFS, MapReduce, HBase● Proven Linear Scalability

Page 23: Lambda architecture @ Indix

Three Problems (Recap)

● No Human Fault Tolerance○ Mutable State

● Operational Complexity○ Random Writes (Compactions)

● High Latency○ Batch system architectural tradeoff

Page 24: Lambda architecture @ Indix

Human Fault Tolerance

● Bugs in the batch jobs○ Discard views & Recompute

● Bugs in the master data jobs○ Re-process the master data to hide the old data

● Bugs in the query○ Re-deploy the query layer

● Traceability as a side effect

Page 25: Lambda architecture @ Indix

Operational Complexity

● No random writes in the batch layer○ Bulk Updates to build the batch view

Page 26: Lambda architecture @ Indix

Great… What about Latency?

Page 27: Lambda architecture @ Indix

Speed Layer

Queue(Kafka)

Recent Data

Real Time Processing(Storm)

QueryHyperloglog SetsHyperloglog SetsHyperloglog

Random Writes

(Updates)

Read-Write Data Store(Riak, HBase, Cassandra)

Page 28: Lambda architecture @ Indix

Speed Layer has mutation... But

● Speed layer deals with much smaller data○ Batch Layer - Months/years of data○ Speed Layer - Few hours or 1 day of data

● Easy to manage operationally

Complexity Isolation

Page 29: Lambda architecture @ Indix

Final Step - Merging Results

Batch Layer

Speed Layer

DataQuery

Merged ResultsC1 - 50000

C1 - 499(Approximate with error 0.02%)

C1 - 50499

Page 30: Lambda architecture @ Indix

What about Accuracy?

Batch Layer

Speed Layer

DataQuery

Merged Results

C1 - 499(Approximate with error 0.02%)

C1’ - 50500

Batch LayerC1’ - 50500C1 - 50000

Eventually Accurate

Page 31: Lambda architecture @ Indix

Lambda Architecture

Page 32: Lambda architecture @ Indix

Lambda Architecture @ INDIX

Page 33: Lambda architecture @ Indix

Lambda Architecture @ Indix

Page 34: Lambda architecture @ Indix

Batch Layer @ Indix

● Pail○ Vertical partitioning ○ Consolidation of small files

● Scalding● Thrift for enforcing schemas● HBase/Solr for views

○ Bulk updates to create views

Page 35: Lambda architecture @ Indix

Speed Layer @ Indix

● Still WIP● To reduce latency

○ Micro batches for Speed layer○ Use the last batch run + bulk update views

Page 36: Lambda architecture @ Indix

Open Challenges

● Managing both Batch & Real Time still painful● Two broad directions

○ Abstractions■ SummingBird (Twitter)

○ Unified Stack■ Spark ■ Kafka + Samza/Storm (LinkedIn)■ Cloud Data Flow (Google)

Page 37: Lambda architecture @ Indix

In Conclusion...

● Lambda Architecture○ A different approach to build data systems○ Solid principles ○ Domain Agnostic○ Tools not yet mature

Page 39: Lambda architecture @ Indix

Key Takeaways

- Human Fault Tolerance

- Complexity Isolation

- Higher Level Abstractions

Page 40: Lambda architecture @ Indix

Thank You

Page 41: Lambda architecture @ Indix

Batch vs Real Time Choices

Page 42: Lambda architecture @ Indix

Tying it all together - Go-CD

Page 43: Lambda architecture @ Indix

Extras

● Monoids● LA is not new

○ Search Engines (fast, slow crawl)

○ Event Sourcing (immutable events to maintain

state)○ Patch, Audit, Bootstrap

Page 44: Lambda architecture @ Indix

Problem Statement - Optimization