35
Cassandra & Next Generation Analysis Cassandra for a high-velocity data ingestion and real-time analysis system. Ameet Chaubal & Fausto Inestroza

C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, Now & Future by Ameet Chaubal and Fausto Inestroza

Embed Size (px)

DESCRIPTION

The presentation aims to highlight the challenges posed by large scale and near real-time data processing problems. In past, such problems were solved using conventional technologies, primarily a database and JMS queue. However these solutions had their limits and presented serious problems in terms of scale and redundancy. The new breed of products - a la Cassandra & Kafka, being innately distributed in their design, aim to tackle such challenges in a very elegant manner. The presentation will showcase some of the use cases of this genre from the industry and describe the solutions which have been increasing in their sophistication.

Citation preview

Page 1: C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, Now & Future by Ameet Chaubal and Fausto Inestroza

Cassandra & Next Generation Analysis

Cassandra for a high-velocity data ingestion and real-time analysis system.

Ameet Chaubal & Fausto Inestroza

Page 2: C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, Now & Future by Ameet Chaubal and Fausto Inestroza

Presentation Route • Describe  conven,onal  technology  solu,on  

• Highlight  deficiencies  • Showcase  new  solu,on  implemented  using  Cassandra  

• Layout  architecture  with  improvements  

Page 3: C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, Now & Future by Ameet Chaubal and Fausto Inestroza

Business Case •  Capture messages from high-volume e-

Commerce site. •  Store them into a database •  Perform near real-time queries for

troubleshooting •  Perform deeper analysis a la BI.

Page 4: C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, Now & Future by Ameet Chaubal and Fausto Inestroza

Olden Days …

JMS Queue

Transient Storage RDBMS

Data warehouse Analysis

eCommerce Website

Page 5: C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, Now & Future by Ameet Chaubal and Fausto Inestroza

Business Case, Details… Messages: 5000 msg/sec ~ 250 million / day Message size : 1 Kb

JMS Queue

Transient Storage RDBMS

Data warehouse

eCommerce Website

Decouple UI from storage Multiple sinks

Dedicated storage Triage

Data Analysis Business Intelligence

Page 6: C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, Now & Future by Ameet Chaubal and Fausto Inestroza

What’s the problem?

JMS Queue

Data warehouse

SITE I

SITE II

JMS Queue

•  Queue Replication problems

•  Message Loss •  Other applications

affected in case of failover

•  Triage data isolated •  No universal view •  Data Consolidation

adds delay •  Inability to keep up

with increasing messages

•  Analysis always lagging the action

•  No low-latency queries

Batch Load

Transient storage

Page 7: C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, Now & Future by Ameet Chaubal and Fausto Inestroza

Problems Recap • Over  5000  msg/sec  High  Write  Speed  

• Extrac9on  &  Load  very  slow  ETL  from  Transient  storage  to  Data  warehouse  takes  over  4  hours  

• Analysis  always  lags  events  by  hours  ETL  performed  in  batches  4  hours  apart  

• No  high  availability  No  Geo-­‐Redundancy  for  Transient  Storage  

• Data  stored  in  disparate  buckets  No  Universal  view  of  data  for  “Triage”  applica9ons/troubleshoo9ng  

• No  dashboard    No  low-­‐latency  queries  

•  No  immediate  alert,  paRern  detec9on  No  real-­‐9me  analysis  

Page 8: C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, Now & Future by Ameet Chaubal and Fausto Inestroza

Thrift Connection

Pool

Online e-Commerce Application

Event JMS

A3

Load Balancer

VIP A6 A5

Replication Consumers

Hector / Java Client -1

Hector / Java Client -2

Hector / Java Client -n

JMS Publisher

A1

A2

Cassandra

A7

A4

Write event to queue

Fetch from queue

Cassandra + Hadoop

A8

Map/Reduce

Hive Queries/BI

Real-Time Dashboard A9

A10

A12 Solution Blueprint

Page 9: C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, Now & Future by Ameet Chaubal and Fausto Inestroza

Role of Data Model Before we get there,

what features are missing from Cassandra in comparison to traditional RDBMS

Page 10: C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, Now & Future by Ameet Chaubal and Fausto Inestroza

Shortcomings… Opportunities •  No Joins across Column Families •  No analytical functions such as sum, count…

•  Difficulty in constructing “WHERE” clause predicates across composite columns

•  Inability to order range of Keys in Random Partitioner

Page 11: C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, Now & Future by Ameet Chaubal and Fausto Inestroza

Importance of Data model - Cassandra •  In lieu of JOINS, “smart” de-normalization techniques

are crucial. •  Need to use “FEATURES” of Cassandra to effectively

model the business rules and business data •  “Client” or “Application” code becomes extremely

important. •  “APPLICATION” + “DATABASE” => Full Package

Page 12: C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, Now & Future by Ameet Chaubal and Fausto Inestroza

Features of Cassandra Modeling •  “WIDE” Column Family

–  Organize data in “horizontal” as opposed to “vertical” fashion as in RDBMS •  Automatic Sorting of Columns

–  Important to “MODEL” the data in “COLUMNS” as opposed to rows.

•  Faster Access to ALL COLUMNS of a Row Key –  All columns of a row key stored on ONE server =>fast iteration/aggregations

•  Useful info in “COLUMN NAME” –  Ground breaking from RDBMS perspective –  Enables “MORE” “INFORMATION” to be PACKED –  “COLUMN” as entity becomes “MORE POWERFUL”.

•  COMPOSITE Column NAMES: –  Column names can be COMPOSITES !!! Made up of multiple columns –  Auto sorting still works

Page 13: C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, Now & Future by Ameet Chaubal and Fausto Inestroza

Data Model Wide  rows  with  sharding  

Row  Key  =  “<min>|<part#>”  

Role  of  par99on  #:    •  Each  row  is  stored  by  a  single  server  and  with  5,000x60=300,000  events  per  minute,  that  would  put  large  load  for  a  minute  on  a  single  server.    

•  A  “par99on”  contrap9on  aims  to  “break”  this  huge  row,  remove  hotspots  and  spread  the  load  to  possibly  all  servers  

•  The  #  of  par99ons,  some  mul9ple  of  the  #  of  servers  •  Finite  #  of  par99ons  –  s9ll  maintains  the  row  key  as  meaningful,  i.e.  we  can  construct  the  keys  for  a  certain  minute  and  fetch  records  for  them.  

Page 14: C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, Now & Future by Ameet Chaubal and Fausto Inestroza

Composite Columns •  Composite Columns:

–  Actual message stored as part of composite column

•  Variable granularity grouping –  Minute: Row key based on minute

Min_par((on  (TEXT)   DC:TimeUUID:UserID:Message(Composite)   …  

2012-­‐07-­‐18-­‐08-­‐13-­‐p-­‐1   Status  

…   …  

2012-­‐07-­‐19-­‐11-­‐21-­‐p-­‐3   Status  

Page 15: C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, Now & Future by Ameet Chaubal and Fausto Inestroza

Benefits

Page 16: C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, Now & Future by Ameet Chaubal and Fausto Inestroza

Data Center 3 (RO)

Data Center 2 (RW)

Data Center 1 (RW)

Geo-Redundancy

16

Data Center 4 (RO)

Page 17: C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, Now & Future by Ameet Chaubal and Fausto Inestroza

Data Consolidation and Extraction •  Single view of data across multiple locations •  Data extraction can be performed in parallel •  Data extraction process performed in

dedicated cluster of machines.

Page 18: C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, Now & Future by Ameet Chaubal and Fausto Inestroza

Low-Latency & Batch Applications •  Triaging

–  Troubleshooting customer issues within 10 minutes of occurrence

–  Feeding a dashboard of live feed data through aggregations performed in Counter CFs

•  Analysis –  Analytical and ad Hoc queries to replace the need

for remote data warehouse eventually –  Map/Reduce via Hive without ETL

Page 19: C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, Now & Future by Ameet Chaubal and Fausto Inestroza

Opportunities Remaining •  Near real-time pattern detection and

response •  Message loss in JMS queue •  JMS queue replication. •  reducing the impact of Queue failover on

other applications

Page 20: C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, Now & Future by Ameet Chaubal and Fausto Inestroza

Further Improvements…

HOW ???

Page 21: C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, Now & Future by Ameet Chaubal and Fausto Inestroza

Accenture  Cloud  PlaAorm  

Recommender  as  a  Service  

…  

Network  Analy9cs  Services  

Big Data Platform

Page 22: C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, Now & Future by Ameet Chaubal and Fausto Inestroza

Drivers

consumer devices

video usage

Issues

Operational Costs

Understanding service quality degradation

Inefficient capacity planning

Page 23: C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, Now & Future by Ameet Chaubal and Fausto Inestroza

INGEST   PROCESS  

VISUALIZE  

ANALYZE  

STORE  

Page 24: C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, Now & Future by Ameet Chaubal and Fausto Inestroza

WHY STORM?

Page 25: C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, Now & Future by Ameet Chaubal and Fausto Inestroza

Scalability

Reliability

Data types, size, velocity

Mission critical data

Processing, computation, etc.

Time series / pattern analysis

Fault-tolerance

What do we need?

Multiple use cases

Page 26: C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, Now & Future by Ameet Chaubal and Fausto Inestroza

How do we get this from Storm?

Processing guarantees

Low-level Primitives

Parallelization

Robust fail-over strategies

Scalability

Reliability

Fault-tolerance

Processing, computation, etc.

Page 27: C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, Now & Future by Ameet Chaubal and Fausto Inestroza

PRIMITIVES  

Page 28: C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, Now & Future by Ameet Chaubal and Fausto Inestroza

Stream  

Spout  

Bolt  

Topology   Subop(mal  network  speed,  geospa(al  analysis    

Request  info  (IP,  user-­‐agent,  etc)  

Pull  messages  from  distributed  queue  

Sessioniza(on,  speed  calcula(on    

Tuple   Tuple   Tuple  

Page 29: C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, Now & Future by Ameet Chaubal and Fausto Inestroza

Integration with Cassandra

Cassandra Optimal for time series data Near-linear scalable Low read/write latency Scales in conjunction with Storm

Custom Bolt Uses Hector API to access Cassandra Creates dynamic columns per request Stores relevant network data

Page 30: C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, Now & Future by Ameet Chaubal and Fausto Inestroza

SUBOPTIMAL NETWORK SPEED TOPOLOGY AN EXAMPLE

Page 31: C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, Now & Future by Ameet Chaubal and Fausto Inestroza

KaUa  Spout   Pre-­‐process   Sessionize  

Calculate  N/W  Speed  per  

Session  

Update  Speed  per  IP  

Iden(fy  Sub-­‐Op(mal  Speed  

Store  in  Cassandra  

Cassandra  

Tuple  (ip  1)   Tuple  (ip  1)   Tuple  (ip  1)   Tuple  (ip  1)   Tuple  (ip  1)   Tuple  (ip  1)  Tuple  (ip  1)  

Page 32: C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, Now & Future by Ameet Chaubal and Fausto Inestroza

Cassandra  

KaUa  Spout   Pre-­‐process   Sessionize  

Calculate  N/W  Speed  per  

Session  

Update  Speed  per  IP  

Iden(fy  Sub-­‐Op(mal  Speed  

Store  in  Cassandra  

Tuple  (ip  2)  Tuple  (ip  2)  Tuple  (ip  2)  

Tuple  (ip  1)  

Tuple  (ip  2)  

Tuple  (ip  1)   Tuple  (ip  1)  

Tuple  (ip  2)   Tuple  (ip  2)  Tuple  (ip  2)  

Tuple  (ip  1)  

Tuple  (ip  2)  

Tuple  (ip  1)  

Tuple  (ip  2)  

Tuple  (ip  1)  Tuple  (ip  1)  Tuple  (ip  1)  

Tuple  (ip  1)  

Parallelism  

Page 33: C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, Now & Future by Ameet Chaubal and Fausto Inestroza

Cassandra  

KaUa  Spout   Pre-­‐process   Sessionize  

Calculate  N/W  Speed  per  

Session  

Update  Speed  per  IP   Join   Compare  

Speed  Store  in  Cassandra  

Speed  by  Loca(on  

Stream  1  

Stream  2  

KaUa  Spout  

Tuple  (ip  1)   Tuple  (ip  1/NY)  

Tuple  (NY)  

Tuple  (ip  1/NY)  

Branching  and  Joins  

Page 34: C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, Now & Future by Ameet Chaubal and Fausto Inestroza

Lessons Learned

•  Rebalance Topology

•  Tweak parallelism in bolt

•  Isolation of Topologies

•  Use TimeUUIDUtils

•  Log4j level set to INFO by default

Page 35: C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, Now & Future by Ameet Chaubal and Fausto Inestroza

Thank You

Q & A